🤖 AI Summary
This work addresses the limitations of existing cross-document coreference resolution (CDCR) datasets, whose narrow definition of coreference fails to capture the lexical variation arising from diverse phrasings and ideological stances in news reporting. To overcome this, the authors propose expanding coreference chains into discourse entities (DEs), which jointly model both strict identity and near-identity relations. Guided by a unified annotation protocol, they manually re-annotate subsets of NewsWCL50 and ECB+ to reflect this broader conceptualization. The resulting dataset is the first in CDCR to systematically encode lexical diversity and discourse framing differences inherent in news contexts. Evaluation shows that the re-annotated data exhibits lexical variability intermediate between the original ECB+ and NewsWCL50, offering a balanced and discourse-aware benchmark that better supports CDCR research in the news domain.
📝 Abstract
Cross-document coreference resolution (CDCR) identifies and links mentions of the same entities and events across related documents, enabling content analysis that aggregates information at the level of discourse participants. However, existing datasets primarily focus on event resolution and employ a narrow definition of coreference, which limits their effectiveness in analyzing diverse and polarized news coverage where wording varies widely. This paper proposes a revised CDCR annotation scheme of the NewsWCL50 dataset, treating coreference chains as discourse elements (DEs) and conceptual units of analysis. The approach accommodates both identity and near-identity relations, e.g., by linking "the caravan" - "asylum seekers" - "those contemplating illegal entry", allowing models to capture lexical diversity and framing variation in media discourse, while maintaining the fine-grained annotation of DEs. We reannotate the NewsWCL50 and a subset of ECB+ using a unified codebook and evaluate the new datasets through lexical diversity metrics and a same-head-lemma baseline. The results show that the reannotated datasets align closely, falling between the original ECB+ and NewsWCL50, thereby supporting balanced and discourse-aware CDCR research in the news domain.