Finding errors and inconsistencies in the CorefUD coreference dataset

Dohnalová, Barbora

Hledání chyb a nekonzistencí v koreferenčním datasetu CorefUD

dc.contributor.advisor	Popel, Martin
dc.creator	Dohnalová, Barbora
dc.date.accessioned	2024-11-28T11:52:16Z
dc.date.available	2024-11-28T11:52:16Z
dc.date.issued	2024
dc.identifier.uri	http://hdl.handle.net/20.500.11956/192872
dc.description.abstract	Projekt CorefUD se snaží harmonizovat anotaci koreference napříč různými jazyky, po vzoru iniciativy Universal Dependencies. Od roku 2022 je CorefUD dataset využíván v CRAC Shared Task on Coreference Resolution, soutěži, ve které je cílem zúčastněných automaticky koreferenci anotovat. Jelikož jsou predikce zúčastněných systémů veřejně dostupné, rozhodli jsme se otestovat s jejich pomocí následující domněnku: Pokud se většina predikcí shodne, že by zlatá data měla být anotovaná jinak, potenciálně to ukazuje na chybu právě ve zlatých datech. Abychom ji ověřili, naprogramujeme PluCorAED, systém na detekci anotačních chyb, který klasifikuje chyby, které predikce udělaly, a posčítá je. Následně provedeme analýzu výsledků, abychom zjistili, pro které typy chyb je tento přístup vhodný. Nakonec shrneme chyby, které jsme našli a opravili v CorefUDu - Některé z oprav již byly zakomponovány do nejnovější verze.	cs_CZ
dc.description.abstract	The CorefUD project attempts to harmonise coreference annotation accross different languages, in the spirit of the Universal Dependencies initiative. Since 2022 it is also used in the CRAC Shared Task on Coreference Resolution, where participants try to annotate the data automatically. As the submission predictions are publicly available, we decided to test the following hypothesis: If most of the predictions agree that the gold annotation should be different, it might indicate an error in the gold data. To verify this, we build PluCorAED, an annotation error detection system, which classifies errors the submissions made and aggregates them. Then we analyse the results to see which types of errors this approach might be suitable for. Finally, we give an overview of the errors we have found and corrected in CorefUD. Some of the corrections have already been incorporated into a new release.	en_US
dc.language	English	cs_CZ
dc.language.iso	en_US
dc.publisher	Univerzita Karlova, Matematicko-fyzikální fakulta	cs_CZ
dc.subject	coreference\|annotation error detection\|CorefUD	en_US
dc.subject	koreference\|detekce chyb v anotaci\|CorefUD	cs_CZ
dc.title	Finding errors and inconsistencies in the CorefUD coreference dataset	en_US
dc.type	bakalářská práce	cs_CZ
dcterms.created	2024
dcterms.dateAccepted	2024-09-05
dc.description.department	Institute of Formal and Applied Linguistics	en_US
dc.description.department	Ústav formální a aplikované lingvistiky	cs_CZ
dc.description.faculty	Matematicko-fyzikální fakulta	cs_CZ
dc.description.faculty	Faculty of Mathematics and Physics	en_US
dc.identifier.repId	255447
dc.title.translated	Hledání chyb a nekonzistencí v koreferenčním datasetu CorefUD	cs_CZ
dc.contributor.referee	Novák, Michal
thesis.degree.name	Bc.
thesis.degree.level	bakalářské	cs_CZ
thesis.degree.discipline	Computer Science with specialisation in Foundations of Computer Science	en_US
thesis.degree.discipline	Informatika se specializací Obecná informatika	cs_CZ
thesis.degree.program	Computer Science	en_US
thesis.degree.program	Informatika	cs_CZ
uk.thesis.type	bakalářská práce	cs_CZ
uk.taxonomy.organization-cs	Matematicko-fyzikální fakulta::Ústav formální a aplikované lingvistiky	cs_CZ
uk.taxonomy.organization-en	Faculty of Mathematics and Physics::Institute of Formal and Applied Linguistics	en_US
uk.faculty-name.cs	Matematicko-fyzikální fakulta	cs_CZ
uk.faculty-name.en	Faculty of Mathematics and Physics	en_US
uk.faculty-abbr.cs	MFF	cs_CZ
uk.degree-discipline.cs	Informatika se specializací Obecná informatika	cs_CZ
uk.degree-discipline.en	Computer Science with specialisation in Foundations of Computer Science	en_US
uk.degree-program.cs	Informatika	cs_CZ
uk.degree-program.en	Computer Science	en_US
thesis.grade.cs	Výborně	cs_CZ
thesis.grade.en	Excellent	en_US
uk.abstract.cs	Projekt CorefUD se snaží harmonizovat anotaci koreference napříč různými jazyky, po vzoru iniciativy Universal Dependencies. Od roku 2022 je CorefUD dataset využíván v CRAC Shared Task on Coreference Resolution, soutěži, ve které je cílem zúčastněných automaticky koreferenci anotovat. Jelikož jsou predikce zúčastněných systémů veřejně dostupné, rozhodli jsme se otestovat s jejich pomocí následující domněnku: Pokud se většina predikcí shodne, že by zlatá data měla být anotovaná jinak, potenciálně to ukazuje na chybu právě ve zlatých datech. Abychom ji ověřili, naprogramujeme PluCorAED, systém na detekci anotačních chyb, který klasifikuje chyby, které predikce udělaly, a posčítá je. Následně provedeme analýzu výsledků, abychom zjistili, pro které typy chyb je tento přístup vhodný. Nakonec shrneme chyby, které jsme našli a opravili v CorefUDu - Některé z oprav již byly zakomponovány do nejnovější verze.	cs_CZ
uk.abstract.en	The CorefUD project attempts to harmonise coreference annotation accross different languages, in the spirit of the Universal Dependencies initiative. Since 2022 it is also used in the CRAC Shared Task on Coreference Resolution, where participants try to annotate the data automatically. As the submission predictions are publicly available, we decided to test the following hypothesis: If most of the predictions agree that the gold annotation should be different, it might indicate an error in the gold data. To verify this, we build PluCorAED, an annotation error detection system, which classifies errors the submissions made and aggregates them. Then we analyse the results to see which types of errors this approach might be suitable for. Finally, we give an overview of the errors we have found and corrected in CorefUD. Some of the corrections have already been incorporated into a new release.	en_US
uk.file-availability	V
uk.grantor	Univerzita Karlova, Matematicko-fyzikální fakulta, Ústav formální a aplikované lingvistiky	cs_CZ
thesis.grade.code	1
uk.publication-place	Praha	cs_CZ
uk.thesis.defenceStatus	O