🤖 AI Summary
Traditional coreference resolution evaluation relies on aggregate metrics such as CoNLL-F1, which often obscure model deficiencies on specific semantic categories like persons, locations, or events. This work proposes a semantics-enhanced evaluation framework that, for the first time, assigns semantic labels to referring expressions by integrating concept and named entity recognition (CNER) and propagates these labels across entire coreference clusters to enable category-wise hierarchical assessment. The approach not only uncovers systematic weaknesses masked by conventional metrics but also informs low-cost data augmentation strategies. Experiments on OntoNotes, LitBank, and PreCo demonstrate that this method significantly improves cross-domain model performance.
📝 Abstract
Coreference resolution is typically evaluated using aggregate statistical metrics such as CoNLL-F1, which measure structural overlap between predicted and gold clusters. While widely used, these metrics offer limited diagnostic insights, penalizing errors without revealing whether a system struggles with specific semantic categories, such as people, locations, or events, and making it difficult to interpret model capabilities or derive actionable improvements. We address this gap by introducing a semantically-enhanced evaluation framework for coreference resolution. Our approach overlays Concept and Named Entity Recognition (CNER) onto coreference outputs, assigning semantic labels to nominal mentions and propagating them to entire coreference clusters. This enables the computation of typed scores aimed at evaluating mention extraction and linking capabilities stratified by semantic class. Across our experiments on OntoNotes, LitBank, and PreCo, we show that our framework uncovers systematic weaknesses that remain obscured by aggregate metrics. Furthermore, we demonstrate that these diagnostics can be used to design targeted, low-cost data augmentation strategies, achieving measurable out-of-domain improvements.