Large Language Models as Span Annotators

📅 2025-04-11

📈 Citations: 0

✨ Influential: 0

career value

174K/year

🤖 AI Summary

This work addresses the high cost and low efficiency of span annotation in text, which traditionally relies on manual effort or fine-tuning encoder-based models. We propose a zero-shot/few-shot direct annotation paradigm leveraging large language models (LLMs), eliminating the need for model fine-tuning. Our method employs structured prompting combined with chain-of-thought (CoT) reasoning to elicit fine-grained, explanation-augmented span annotations from both open-source (e.g., Llama) and closed-source (e.g., GPT-series) LLMs. Key contributions include: (1) the first systematic empirical validation that LLMs can directly perform span annotation at competitive quality; (2) demonstration that reasoning-oriented LLMs achieve annotation quality, interpretability, and inter-annotator consistency (Cohen’s κ ≈ 0.4–0.6) comparable to human annotators; and (3) over 90% reduction in annotation cost versus conventional approaches. To support reproducibility and future research, we release a high-quality benchmark dataset comprising over 40,000 annotated spans.

Technology Category

Application Category

📝 Abstract

For high-quality texts, single-score metrics seldom provide actionable feedback. In contrast, span annotation - pointing out issues in the text by annotating their spans - can guide improvements and provide insights. Until recently, span annotation was limited to human annotators or fine-tuned encoder models. In this study, we automate span annotation with large language models (LLMs). We compare expert or skilled crowdworker annotators with open and proprietary LLMs on three tasks: data-to-text generation evaluation, machine translation evaluation, and propaganda detection in human-written texts. In our experiments, we show that LLMs as span annotators are straightforward to implement and notably more cost-efficient than human annotators. The LLMs achieve moderate agreement with skilled human annotators, in some scenarios comparable to the average agreement among the annotators themselves. Qualitative analysis shows that reasoning models outperform their instruction-tuned counterparts and provide more valid explanations for annotations. We release the dataset of more than 40k model and human annotations for further research.

Problem

Research questions and friction points this paper is trying to address.

Automating span annotation using large language models

Comparing LLMs with human annotators on text evaluation tasks

Assessing cost-efficiency and accuracy of LLM-based span annotation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Automate span annotation using large language models

Compare human and LLM performance on annotation tasks

Release dataset of 40k annotations for research

🔎 Similar Papers

Leveraging Large Language Models for Relevance Judgments in Legal Case Retrieval