The Effectiveness of LLMs as Annotators: A Comparative Overview and Empirical Analysis of Direct Representation

📅 2024-05-02

🏛️ NLPERSPECTIVES

📈 Citations: 21

✨ Influential: 1

career value

186K/year

🤖 AI Summary

This work systematically evaluates the efficacy and limitations of large language models (LLMs) as annotators for subjective tasks. We survey 12 prior studies and empirically compare opinion distribution alignment between GPT-series models and human annotators across four subjective datasets—introducing, for the first time, a novel evaluation paradigm centered on *perspective diversity alignment*. Results reveal substantial distributional biases in LLMs, including underrepresentation of minority viewpoints, prompt sensitivity, English-language bias, and embedded societal prejudices; most existing annotation methods overlook such distributional discrepancies, and only a few strategies effectively capture opinion diversity. Our findings expose critical reliability risks in deploying LLMs for subjective annotation and establish a reproducible statistical framework—comprising quantitative metrics and methodological guidelines—for assessing annotation quality in subjective NLP tasks.

Technology Category

Application Category

📝 Abstract

Recent studies focus on exploring the capability of Large Language Models (LLMs) for data annotation. Our work, firstly, offers a comparative overview of twelve such studies that investigate labelling with LLMs, particularly focusing on classification tasks. Secondly, we present an empirical analysis that examines the degree of alignment between the opinion distributions returned by GPT and those provided by human annotators across four subjective datasets. Our analysis supports a minority of studies that are considering diverse perspectives when evaluating data annotation tasks and highlights the need for further research in this direction.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' effectiveness in data annotation tasks

Comparing human and GPT-generated opinion distributions

Addressing limitations like bias and prompt sensitivity

Innovation

Methods, ideas, or system contributions that make the work stand out.

Comparative overview of twelve LLM annotation studies

Empirical analysis of human-GPT opinion alignment

Direct opinion distribution extraction from GPT

🔎 Similar Papers

No similar papers found.