Augmenting NER Datasets with LLMs: Towards Automated and Refined Annotation

📅 2024-03-30

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

To address high annotation costs, label noise, and severe class imbalance in Named Entity Recognition (NER) datasets, this paper proposes a human-in-the-loop hybrid annotation paradigm. Methodologically, we introduce a novel dynamic label-mixing sampling strategy to mitigate class bias inherent in large language model (LLM)-generated annotations; further, we organically integrate human-curated fine-grained labels with LLM-produced coarse labels within an iterative optimization framework incorporating prompt engineering, human feedback loops, and multi-stage noise filtering. Our contributions are threefold: (1) achieving significantly improved annotation quality and efficiency while maintaining controllable annotation costs; (2) attaining 3.2–5.8 percentage-point F1-score gains for NER models trained on our annotated data over those trained via conventional methods across multiple benchmark datasets; and (3) demonstrating the paradigm’s effectiveness and scalability under tight budget constraints.

Technology Category

Application Category

📝 Abstract

In the field of Natural Language Processing (NLP), Named Entity Recognition (NER) is recognized as a critical technology, employed across a wide array of applications. Traditional methodologies for annotating datasets for NER models are challenged by high costs and variations in dataset quality. This research introduces a novel hybrid annotation approach that synergizes human effort with the capabilities of Large Language Models (LLMs). This approach not only aims to ameliorate the noise inherent in manual annotations, such as omissions, thereby enhancing the performance of NER models, but also achieves this in a cost-effective manner. Additionally, by employing a label mixing strategy, it addresses the issue of class imbalance encountered in LLM-based annotations. Through an analysis across multiple datasets, this method has been consistently shown to provide superior performance compared to traditional annotation methods, even under constrained budget conditions. This study illuminates the potential of leveraging LLMs to improve dataset quality, introduces a novel technique to mitigate class imbalances, and demonstrates the feasibility of achieving high-performance NER in a cost-effective way.

Problem

Research questions and friction points this paper is trying to address.

Name Entity Recognition

Data Annotation

Cost-effectiveness

Innovation

Methods, ideas, or system contributions that make the work stand out.

Large Language Models

Ensemble Labeling

Cost-effective Learning

🔎 Similar Papers

LLMAEL: Large Language Models are Good Context Augmenters for Entity Linking