Towards Generating Automatic Anaphora Annotations

📅 2025-03-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the scarcity of high-quality coreference-annotated data and the prohibitive cost of manual annotation for coreference resolution. To this end, we propose a fully automated, dual-path data generation framework that requires no human labeling: (1) rule-based direct conversion leveraging existing corpora, and (2) cross-lingual joint parsing of dependency and coreference structures using multilingual pretrained models. Our framework is the first to systematically integrate structured data mapping with multilingual transfer capabilities, enabling coreference annotation for low-resource and unseen languages. Experiments across diverse languages demonstrate that the generated data achieves high quality and strong generalization, significantly reducing annotation costs. The approach establishes a scalable, reusable data infrastructure for coreference resolution—advancing both data efficiency and cross-lingual applicability in the field.

Technology Category

Application Category

📝 Abstract
Training models that can perform well on various NLP tasks require large amounts of data, and this becomes more apparent with nuanced tasks such as anaphora and conference resolution. To combat the prohibitive costs of creating manual gold annotated data, this paper explores two methods to automatically create datasets with coreferential annotations; direct conversion from existing datasets, and parsing using multilingual models capable of handling new and unseen languages. The paper details the current progress on those two fronts, as well as the challenges the efforts currently face, and our approach to overcoming these challenges.
Problem

Research questions and friction points this paper is trying to address.

Generating automatic anaphora annotations for NLP tasks
Reducing costs of manual gold annotated data creation
Handling new and unseen languages in coreference resolution
Innovation

Methods, ideas, or system contributions that make the work stand out.

Direct conversion from existing datasets
Parsing using multilingual models
Handling new and unseen languages
D
Dima Taji
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (ÚFAL), Prague, Czechia
Daniel Zeman
Daniel Zeman
Institute of Formal and Applied Linguistics, Faculty of Mathematics and Physics, Charles University
Natural Language ProcessingMorphologyParsingMeaning Representation