🤖 AI Summary
Existing evaluation frameworks for coreference resolution are ill-suited for large language models (LLMs), and multilingual LLM-based coreference resolution lacks standardized, fair benchmarks. Method: The authors introduce the first LLM-specific evaluation track for multilingual coreference resolution, using plain-text input only. They construct a unified, equitable multilingual benchmark based on an extended CorefUD v1.3—covering 17 languages and 25 datasets, including three newly added datasets and two new languages. They systematically compare fine-tuned and few-shot LLM approaches against traditional systems. Contribution/Results: Among nine participating systems, four are LLM-based. Although traditional systems still lead overall, LLMs demonstrate strong competitive performance, validating their feasibility for cross-lingual coreference modeling. This work establishes a novel paradigm and a reproducible, multilingual benchmark for future LLM-driven coreference resolution research.
📝 Abstract
The paper presents an overview of the fourth edition of the Shared Task on Multilingual Coreference Resolution, organized as part of the CODI-CRAC 2025 workshop. As in the previous editions, participants were challenged to develop systems that identify mentions and cluster them according to identity coreference.
A key innovation of this year's task was the introduction of a dedicated Large Language Model (LLM) track, featuring a simplified plaintext format designed to be more suitable for LLMs than the original CoNLL-U representation.
The task also expanded its coverage with three new datasets in two additional languages, using version 1.3 of CorefUD - a harmonized multilingual collection of 22 datasets in 17 languages.
In total, nine systems participated, including four LLM-based approaches (two fine-tuned and two using few-shot adaptation). While traditional systems still kept the lead, LLMs showed clear potential, suggesting they may soon challenge established approaches in future editions.