🤖 AI Summary
Accurate evaluation of large-scale biomedical named entity linking (NEL) systems is hindered by the high cost of expert annotation. This work proposes an efficient sampling-based evaluation framework that formulates accuracy estimation as an optimization problem subject to a margin-of-error constraint, thereby minimizing annotation effort. The approach innovatively adapts stratified two-stage cluster sampling (STWCS) to the NEL task by introducing annotation-agnostic label stratification and global surface-form clustering strategies, enabling generalizable and statistically reliable estimates. Evaluated on the GutBrainIE corpus, the method achieves a margin of error ≤0.05—yielding an estimated accuracy of 0.915 ± 0.0473—with only 24.6% (2,749) of the total triplets annotated, reducing annotation time by approximately 29% compared to simple random sampling.
📝 Abstract
Named Entity Linking (NEL) is a core component of biomedical Information Extraction (IE) pipelines, yet assessing its quality at scale is challenging due to the high cost of expert annotations and the large size of corpora. In this paper, we present a sampling-based framework to estimate the NEL accuracy of large-scale IE corpora under statistical guarantees and constrained annotation budgets. We frame NEL accuracy estimation as a constrained optimization problem, where the objective is to minimize expected annotation cost subject to a target Margin of Error (MoE) for the corpus-level accuracy estimate. Building on recent works on knowledge graph accuracy estimation, we adapt Stratified Two-Stage Cluster Sampling (STWCS) to the NEL setting, defining label-based strata and global surface-form clusters in a way that is independent of NEL annotations. Applied to 11,184 NEL annotations in GutBrainIE -- a new biomedical corpus openly released in fall 2025 -- our framework reaches a MoE $\leq 0.05$ by manually annotating only 2,749 triples (24.6%), leading to an overall accuracy estimate of $0.915 \pm 0.0473$. A time-based cost model and simulations against a Simple Random Sampling (SRS) baseline show that our design reduces expert annotation time by about 29% at fixed sample size. The framework is generic and can be applied to other NEL benchmarks and IE pipelines that require scalable and statistically robust accuracy assessment.