Do Lexical and Contextual Coreference Resolution Systems Degrade Differently under Mention Noise? An Empirical Study on Scientific Software Mentions

📅 2026-04-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates the robustness of different approaches to cross-document coreference resolution for scientific software mentions under mention-level noise. We systematically evaluate unfine-tuned Fuzzy Matching (FM) against Context-Aware Representations (CAR) using controlled noise injection experiments in the SOMD 2026 shared task, analyzing their performance degradation patterns and inference efficiency under boundary perturbations and mention substitutions. Our work reveals, for the first time, complementary failure characteristics between FM and CAR, and quantifies their computational complexity differences: CAR achieves an F1 score of 0.95–0.96 on the official test set—significantly outperforming FM—and demonstrates greater robustness to boundary noise (F1 drops by only 0.07). Moreover, CAR exhibits near-linear inference complexity, making it more suitable for large-scale deployment.
📝 Abstract
We present our participation in the SOMD 2026 shared task on cross-document software mention coreference resolution, where our systems ranked second across all three subtasks. We compare two fine-tuning-free approaches: Fuzzy Matching (FM), a lexical string-similarity method, and Context Aware Representations (CAR), which combines mention-level and document-level embeddings. Both achieve competitive performance across all subtasks (CoNLL F1 of 0.94-0.96), with CAR consistently outperforming FM by 1 point on the official test set, consistent with the high surface regularity of software names, which reduces the need for complex semantic reasoning. A controlled noise-injection study reveals complementary failure modes: as boundary noise increases, CAR loses only 0.07 F1 points from clean to fully corrupted input, compared to 0.20 for FM, whereas under mention substitution, FM degrades more gracefully (0.52 vs. 0.63). Our inference-time analysis shows that FM scales superlinearly with corpus size, whereas CAR scales approximately linearly, making CAR the more efficient choice at large scale. These findings suggest that system selection should be informed by both the noise profile of the upstream mention detector and the scale of the target corpus. We release our code to support future work on this underexplored task.
Problem

Research questions and friction points this paper is trying to address.

coreference resolution
mention noise
software mentions
lexical methods
contextual representations
Innovation

Methods, ideas, or system contributions that make the work stand out.

coreference resolution
mention noise
context-aware representations
fuzzy matching
scalability
🔎 Similar Papers
No similar papers found.
A
Atilla Kaan Alkan
Harvard-Smithsonian Center for Astrophysics, Cambridge, MA, USA
Felix Grezes
Felix Grezes
NASA/ADS
AINeural networks architectures
J
Jennifer Lynn Bartlett
Harvard-Smithsonian Center for Astrophysics, Cambridge, MA, USA
Anna Kelbert
Anna Kelbert
Harvard-Smithsonian Center for Astrophysics, Science Explorer (SciX)
Open ScienceGeophysicsMagnetotelluricsSpace WeatherNumerical Modeling and Inversion
K
Kelly Lockhart
Harvard-Smithsonian Center for Astrophysics, Cambridge, MA, USA
Alberto Accomazzi
Alberto Accomazzi
Director, NASA Astrophysics Data Systerm, Smithsonian Astrophysical Observatory
AstronomyInformation Science