Augmenting NER Datasets with LLMs: Towards Automated and Refined Annotation

📅 2024-03-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address high annotation costs, label noise, and severe class imbalance in Named Entity Recognition (NER) datasets, this paper proposes a human-in-the-loop hybrid annotation paradigm. Methodologically, we introduce a novel dynamic label-mixing sampling strategy to mitigate class bias inherent in large language model (LLM)-generated annotations; further, we organically integrate human-curated fine-grained labels with LLM-produced coarse labels within an iterative optimization framework incorporating prompt engineering, human feedback loops, and multi-stage noise filtering. Our contributions are threefold: (1) achieving significantly improved annotation quality and efficiency while maintaining controllable annotation costs; (2) attaining 3.2–5.8 percentage-point F1-score gains for NER models trained on our annotated data over those trained via conventional methods across multiple benchmark datasets; and (3) demonstrating the paradigm’s effectiveness and scalability under tight budget constraints.

Technology Category

Application Category

📝 Abstract
In the field of Natural Language Processing (NLP), Named Entity Recognition (NER) is recognized as a critical technology, employed across a wide array of applications. Traditional methodologies for annotating datasets for NER models are challenged by high costs and variations in dataset quality. This research introduces a novel hybrid annotation approach that synergizes human effort with the capabilities of Large Language Models (LLMs). This approach not only aims to ameliorate the noise inherent in manual annotations, such as omissions, thereby enhancing the performance of NER models, but also achieves this in a cost-effective manner. Additionally, by employing a label mixing strategy, it addresses the issue of class imbalance encountered in LLM-based annotations. Through an analysis across multiple datasets, this method has been consistently shown to provide superior performance compared to traditional annotation methods, even under constrained budget conditions. This study illuminates the potential of leveraging LLMs to improve dataset quality, introduces a novel technique to mitigate class imbalances, and demonstrates the feasibility of achieving high-performance NER in a cost-effective way.
Problem

Research questions and friction points this paper is trying to address.

Name Entity Recognition
Data Annotation
Cost-effectiveness
Innovation

Methods, ideas, or system contributions that make the work stand out.

Large Language Models
Ensemble Labeling
Cost-effective Learning
🔎 Similar Papers
No similar papers found.
Y
Yuji Naraki
Independent Researcher
R
Ryosuke Yamaki
Ritsumeikan University / ProPlace Inc
Y
Yoshikazu Ikeda
Osaka University / ProPlace Inc
T
Takafumi Horie
Ritsumeikan University
K
Kotaro Yoshida
Science Tokyo
R
Ryotaro Shimizu
ZOZO Research
H
Hiroki Naganuma
Mila, Université de Montréal / ProPlace Inc