Familiarity: Better Evaluation of Zero-Shot Named Entity Recognition by Quantifying Label Shifts in Synthetic Training Data

📅 2024-12-13
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing zero-shot named entity recognition (NER) methods rely on synthetic training data with high semantic overlap with evaluation sets, leading to severe overestimation of true zero-shot performance and lacking quantitative assessment of label shift. Method: We propose the first label-shift-aware evaluation paradigm for zero-shot NER, introducing the Familiarity metric that jointly models label semantic similarity (via PLM embeddings) and training label frequency—explicitly incorporating label shift into evaluation. Our framework enables controllable-difficulty, fine-grained, and reproducible benchmarking. Contribution/Results: Experiments reveal that mainstream methods suffer systematic F1 overestimation of 10–30 points under label familiarity bias. Our work establishes a more realistic, comparable, and interpretable evaluation benchmark for zero-shot NER, advancing rigorous assessment beyond superficial semantic overlap.

Technology Category

Application Category

📝 Abstract
Zero-shot named entity recognition (NER) is the task of detecting named entities of specific types (such as 'Person' or 'Medicine') without any training examples. Current research increasingly relies on large synthetic datasets, automatically generated to cover tens of thousands of distinct entity types, to train zero-shot NER models. However, in this paper, we find that these synthetic datasets often contain entity types that are semantically highly similar to (or even the same as) those in standard evaluation benchmarks. Because of this overlap, we argue that reported F1 scores for zero-shot NER overestimate the true capabilities of these approaches. Further, we argue that current evaluation setups provide an incomplete picture of zero-shot abilities since they do not quantify the label shift (i.e., the similarity of labels) between training and evaluation datasets. To address these issues, we propose Familiarity, a novel metric that captures both the semantic similarity between entity types in training and evaluation, as well as their frequency in the training data, to provide an estimate of label shift. It allows researchers to contextualize reported zero-shot NER scores when using custom synthetic training datasets. Further, it enables researchers to generate evaluation setups of various transfer difficulties for fine-grained analysis of zero-shot NER.
Problem

Research questions and friction points this paper is trying to address.

Evaluating label shift in synthetic NER training data
Quantifying semantic similarity between training and evaluation entities
Providing contextualized zero-shot NER performance assessment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Proposes Familiarity metric for label shift
Quantifies semantic similarity of entity types
Measures training data frequency for evaluation
🔎 Similar Papers
No similar papers found.