🤖 AI Summary
Existing zero-shot named entity recognition (NER) methods rely on synthetic training data with high semantic overlap with evaluation sets, leading to severe overestimation of true zero-shot performance and lacking quantitative assessment of label shift. Method: We propose the first label-shift-aware evaluation paradigm for zero-shot NER, introducing the Familiarity metric that jointly models label semantic similarity (via PLM embeddings) and training label frequency—explicitly incorporating label shift into evaluation. Our framework enables controllable-difficulty, fine-grained, and reproducible benchmarking. Contribution/Results: Experiments reveal that mainstream methods suffer systematic F1 overestimation of 10–30 points under label familiarity bias. Our work establishes a more realistic, comparable, and interpretable evaluation benchmark for zero-shot NER, advancing rigorous assessment beyond superficial semantic overlap.
📝 Abstract
Zero-shot named entity recognition (NER) is the task of detecting named entities of specific types (such as 'Person' or 'Medicine') without any training examples. Current research increasingly relies on large synthetic datasets, automatically generated to cover tens of thousands of distinct entity types, to train zero-shot NER models. However, in this paper, we find that these synthetic datasets often contain entity types that are semantically highly similar to (or even the same as) those in standard evaluation benchmarks. Because of this overlap, we argue that reported F1 scores for zero-shot NER overestimate the true capabilities of these approaches. Further, we argue that current evaluation setups provide an incomplete picture of zero-shot abilities since they do not quantify the label shift (i.e., the similarity of labels) between training and evaluation datasets. To address these issues, we propose Familiarity, a novel metric that captures both the semantic similarity between entity types in training and evaluation, as well as their frequency in the training data, to provide an estimate of label shift. It allows researchers to contextualize reported zero-shot NER scores when using custom synthetic training datasets. Further, it enables researchers to generate evaluation setups of various transfer difficulties for fine-grained analysis of zero-shot NER.