Beyond Boundaries: Learning a Universal Entity Taxonomy across Datasets and Languages for Open Named Entity Recognition

📅 2024-06-17
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
Open Named Entity Recognition (Open NER) suffers from weak cross-dataset and cross-lingual generalization, compounded by inconsistent entity category definitions across resources. To address these challenges, we propose B2NER: a unified framework comprising three core components. First, we construct the first comprehensive, fine-grained entity taxonomy covering 400+ categories. Second, we introduce a two-stage data refinement paradigm—standardizing heterogeneous category schemas via taxonomy alignment, followed by diversity-driven data pruning based on semantic clustering and coverage optimization. Third, we employ lightweight supervised fine-tuning to adapt large language models efficiently. B2NER achieves taxonomy-consistent modeling across 54 Chinese and English NER datasets. Experiments demonstrate that B2NER consistently outperforms GPT-4 by +6.8–12.0 F1 points across 15 datasets and six languages on three cross-domain benchmarks, significantly advancing zero-shot and few-shot Open NER performance.

Technology Category

Application Category

📝 Abstract
Open Named Entity Recognition (NER), which involves identifying arbitrary types of entities from arbitrary domains, remains challenging for Large Language Models (LLMs). Recent studies suggest that fine-tuning LLMs on extensive NER data can boost their performance. However, training directly on existing datasets neglects their inconsistent entity definitions and redundant data, limiting LLMs to dataset-specific learning and hindering out-of-domain adaptation. To address this, we present B2NERD, a compact dataset designed to guide LLMs' generalization in Open NER under a universal entity taxonomy. B2NERD is refined from 54 existing English and Chinese datasets using a two-step process. First, we detect inconsistent entity definitions across datasets and clarify them by distinguishable label names to construct a universal taxonomy of 400+ entity types. Second, we address redundancy using a data pruning strategy that selects fewer samples with greater category and semantic diversity. Comprehensive evaluation shows that B2NERD significantly enhances LLMs' Open NER capabilities. Our B2NER models, trained on B2NERD, outperform GPT-4 by 6.8-12.0 F1 points and surpass previous methods in 3 out-of-domain benchmarks across 15 datasets and 6 languages. The data, models, and code are publicly available at https://github.com/UmeanNever/B2NER.
Problem

Research questions and friction points this paper is trying to address.

Open Named Entity Recognition
Generalization Ability
Entity Category Consistency
Innovation

Methods, ideas, or system contributions that make the work stand out.

B2NERD
Cross-lingual Entity Recognition
Unified Entity Classification
🔎 Similar Papers
No similar papers found.
Yuming Yang
Yuming Yang
Fudan University
Natural Language ProcessingLarge Language Models
W
Wantong Zhao
School of Computer Science, Fudan University, China
Caishuang Huang
Caishuang Huang
Fudan University
LLM、RLHF、Tool Learning
J
Junjie Ye
School of Computer Science, Fudan University, China
X
Xiao Wang
School of Computer Science, Fudan University, China
H
Huiyuan Zheng
School of Computer Science, Fudan University, China
Y
Yang Nan
School of Computer Science, Fudan University, China
Y
Yuran Wang
Honor Device Co., Ltd
X
Xueying Xu
Honor Device Co., Ltd
K
Kaixin Huang
Honor Device Co., Ltd
Y
Yunke Zhang
Honor Device Co., Ltd
T
Tao Gui
Institute of Modern Languages and Linguistics, Fudan University, China
Q
Qi Zhang
School of Computer Science, Fudan University, China
X
Xuanjing Huang
School of Computer Science, Fudan University, China