🤖 AI Summary
Ultra-fine-grained entity set expansion (Ultra-ESE) suffers from semantic ambiguity and difficulty modeling “undesired semantics” when relying solely on positive seed entities. Method: This paper introduces negative seed entities for the first time to explicitly disambiguate highly similar semantic classes, proposing a negative-seed-driven modeling paradigm. We construct UltraWiki—the first ultra-fine-grained benchmark dataset—and design a dual-path evaluation framework: RetExpan (retrieval-augmented expansion) and GenExpan (chain-of-thought–enhanced LLM generation). Results: Experiments across 236 ultra-fine-grained semantic classes demonstrate significant improvements in expansion accuracy. Moreover, our analysis reveals substantial limitations of current large language models in Ultra-ESE tasks, establishing both a novel methodological paradigm and a foundational benchmark to advance future research.
📝 Abstract
Entity Set Expansion (ESE) aims to identify new entities belonging to the same semantic class as a given set of seed entities. Traditional methods primarily relied on positive seed entities to represent a target semantic class, which poses challenge for the representation of ultra-fine-grained semantic classes. Ultra-fine-grained semantic classes are defined based on fine-grained semantic classes with more specific attribute constraints. Describing it with positive seed entities alone cause two issues: (i) Ambiguity among ultra-fine-grained semantic classes. (ii) Inability to define"unwanted"semantic. Due to these inherent shortcomings, previous methods struggle to address the ultra-fine-grained ESE (Ultra-ESE). To solve this issue, we first introduce negative seed entities in the inputs, which belong to the same fine-grained semantic class as the positive seed entities but differ in certain attributes. Negative seed entities eliminate the semantic ambiguity by contrast between positive and negative attributes. Meanwhile, it provide a straightforward way to express"unwanted". To assess model performance in Ultra-ESE, we constructed UltraWiki, the first large-scale dataset tailored for Ultra-ESE. UltraWiki encompasses 236 ultra-fine-grained semantic classes, where each query of them is represented with 3-5 positive and negative seed entities. A retrieval-based framework RetExpan and a generation-based framework GenExpan are proposed to comprehensively assess the efficacy of large language models from two different paradigms in Ultra-ESE. Moreover, we devised three strategies to enhance models' comprehension of ultra-fine-grained entities semantics: contrastive learning, retrieval augmentation, and chain-of-thought reasoning. Extensive experiments confirm the effectiveness of our proposed strategies and also reveal that there remains a large space for improvement in Ultra-ESE.