🤖 AI Summary
This work systematically evaluates the efficacy and limitations of large language models (LLMs) as annotators for subjective tasks. We survey 12 prior studies and empirically compare opinion distribution alignment between GPT-series models and human annotators across four subjective datasets—introducing, for the first time, a novel evaluation paradigm centered on *perspective diversity alignment*. Results reveal substantial distributional biases in LLMs, including underrepresentation of minority viewpoints, prompt sensitivity, English-language bias, and embedded societal prejudices; most existing annotation methods overlook such distributional discrepancies, and only a few strategies effectively capture opinion diversity. Our findings expose critical reliability risks in deploying LLMs for subjective annotation and establish a reproducible statistical framework—comprising quantitative metrics and methodological guidelines—for assessing annotation quality in subjective NLP tasks.
📝 Abstract
Recent studies focus on exploring the capability of Large Language Models (LLMs) for data annotation. Our work, firstly, offers a comparative overview of twelve such studies that investigate labelling with LLMs, particularly focusing on classification tasks. Secondly, we present an empirical analysis that examines the degree of alignment between the opinion distributions returned by GPT and those provided by human annotators across four subjective datasets. Our analysis supports a minority of studies that are considering diverse perspectives when evaluating data annotation tasks and highlights the need for further research in this direction.