What Do Biomedical NER and Entity Linking Benchmarks Measure? A Corpus-Centric Diagnostic Framework

📅 2026-05-19

📈 Citations: 0

✨ Influential: 0

career value

158K/year

🤖 AI Summary

Current benchmarks for biomedical named entity recognition and linking suffer from limited interpretability and generalizability due to the absence of systematic characterization of corpus properties. This work proposes a corpus-centric diagnostic framework that introduces, for the first time, a standardized set of metrics spanning five dimensions: scale, structure, train–test overlap, metadata richness, and terminology coverage. By integrating terminology mapping comparisons with interactive visualizations, the framework enables fine-grained analysis of multi-source corpora. Experiments across nine corpora covering diseases, chemicals, and cell types reveal substantial disparities among datasets within the same task category in terms of evaluation signals and conceptual coverage. The authors release an open-source diagnostic tool supporting interactive exploration to facilitate more precise benchmark interpretation and identification of transfer risks.

📝 Abstract

Biomedical named entity recognition (NER) and entity linking (EL) strongly depend on annotated corpora, but the utility of these resources for benchmarking is often assumed rather than characterized. We present a corpus-centric framework for diagnosing benchmark-relevant properties directly from corpus annotations, concept links, train-test splits, document metadata, and terminology mappings. The framework organizes standardized statistics into five families: (1) scale, density and label distribution, (2) lexical and conceptual structure, (3) train-test overlap, (4) metadata composition, and (5) terminology coverage where applicable. Applying the framework to nine corpora spanning diseases, chemicals, and cell types, we find that corpus properties can differ substantially, even when they address the same apparent task. We find differences in the evaluation signal they provide, the generalization demands they impose, the degree of train-test reuse they permit, and the regions of biomedical literature and concept space they represent. These differences suggest that commonly reported corpus statistics can be insufficient to characterize what biomedical NER and EL benchmarks evaluate. We argue that corpus-centric diagnostics provide a practical framework for analyzing corpora beyond surface descriptors such as corpus size and entity type, for identifying potential transfer risks, and for interpreting the scope of benchmarking conclusions. We release the framework as open-source code with an interactive dashboard to support reproducing our analyses and characterizing additional corpora.

Problem

Research questions and friction points this paper is trying to address.

biomedical NER

entity linking

benchmarking

corpus diagnostics

annotation corpora

Innovation

Methods, ideas, or system contributions that make the work stand out.

corpus-centric diagnostics

biomedical NER

entity linking