Piecing Together Cross-Document Coreference Resolution Datasets: Systematic Dataset Analysis and Unification

📅 2026-02-28

📈 Citations: 0

✨ Influential: 0

career value

144K/year

🤖 AI Summary

Cross-document coreference resolution (CDCR) research has been fragmented due to heterogeneous datasets, inconsistent annotation standards, and an overemphasis on event coreference, hindering reproducibility and fair comparison. This work proposes uCDCR—the first unified English CDCR dataset that jointly integrates entity and event coreference—by standardizing annotations, correcting inconsistencies, completing missing attributes, and establishing a consistent evaluation framework. The curation pipeline employs standardized parsing, cross-annotation consistency checks, and lexical diversity analysis, supported by an open-source toolchain via Hugging Face and GitHub. Experiments reveal that the widely used ECB+ benchmark exhibits the lowest lexical diversity and that entity and event coreference present comparable levels of difficulty. uCDCR substantially enhances model generalization and advances CDCR toward a more comprehensive and balanced paradigm.

Technology Category

Application Category

📝 Abstract

Research in CDCR remains fragmented due to heterogeneous dataset formats, varying annotation standards, and the predominance of the CDCR definition as the event coreference resolution (ECR). To address these challenges, we introduce uCDCR, a unified dataset that consolidates diverse publicly available English CDCR corpora across various domains into a consistent format, which we analyze with standardized metrics and evaluation protocols. uCDCR incorporates both entity and event coreference, corrects known inconsistencies, and enriches datasets with missing attributes to facilitate reproducible research. We establish a cohesive framework for fair, interpretable, and cross-dataset analysis in CDCR and compare the datasets on their lexical properties, e.g., lexical composition of the annotated mentions, lexical diversity and ambiguity metrics, discuss the annotation rules and principles that lead to high lexical diversity, and examine how these metrics influence performance on the same-head-lemma baseline. Our dataset analysis shows that ECB+, the state-of-the-art benchmark for CDCR, has one of the lowest lexical diversities, and its CDCR complexity, measured by the same-head-lemma baseline, lies in the middle among all uCDCR datasets. Moreover, comparing document and mention distributions between ECB+ and uCDCR shows that using all uCDCR datasets for model training and evaluation will improve the generalizability of CDCR models. Finally, the almost identical performance on the same-head-lemma baseline, separately applied to events and entities, shows that resolving both types is a complex task and should not be steered toward ECR alone. The uCDCR dataset is available at https://huggingface.co/datasets/AnZhu/uCDCR, and the code for parsing, analyzing, and scoring the dataset is available at https://github.com/anastasia-zhukova/uCDCR.

Problem

Research questions and friction points this paper is trying to address.

cross-document coreference resolution

dataset unification

annotation inconsistency

event coreference resolution

lexical diversity

Innovation

Methods, ideas, or system contributions that make the work stand out.

cross-document coreference resolution

dataset unification

entity and event coreference