๐ค AI Summary
This work addresses the inconsistency of cross-document software mentions in scientific literature by proposing a normalization framework that integrates semantic embeddings, knowledge base retrieval, and density-based clustering. The approach combines Sentence-BERT for semantic representation with FAISS for efficient retrieval, and incorporates surface form normalization and abbreviation resolution to handle out-of-vocabulary and ambiguous mentions. To enhance scalability in large-scale settings, an entity-type-aware blocking strategy is introduced. Guided by semantic centroids, HDBSCAN clustering enables the system to achieve CoNLL F1 scores of 0.98, 0.98, and 0.96 on the three subtasks of the SOMD 2026 shared task, substantially outperforming baseline methods.
๐ Abstract
This paper describes the system submitted to the SOMD 2026 Shared Task for Cross-Document Coreference Resolution (CDCR) of software mentions. Our approach addresses the challenge of identifying and clustering inconsistent software mentions across scientific corpora. We propose a hybrid framework that combines dense semantic embeddings from a pre-trained Sentence-BERT model, Knowledge Base (KB) lookup strategy built from training-set cluster centroids using FAISS for efficient retrieval, and HDBSCAN density-based clustering for mentions that cannot be confidently assigned to existing clusters. Surface-form normalization and abbreviation resolution are applied to improve canonical name matching. The same core pipeline is applied to Subtasks 1 and 2. To address the large scale settings of Subtask 3, the pipeline was adapted by utilising a blocking strategy based on entity types and canonicalized surface forms. Our system achieved CoNLL F1 scores of 0.98, 0.98, and 0.96 on Subtasks 1, 2, and 3 respectively.