Do We Still Need Humans in the Loop? Comparing Human and LLM Annotation in Active Learning for Hostility Detection

📅 2026-04-15
📈 Citations: 0
Influential: 0
📄 PDF

career value

173K/year
🤖 AI Summary
This study investigates whether human involvement remains necessary in active learning given the ability of large language models (LLMs) to inexpensively annotate entire datasets. Using a newly compiled dataset of 277,902 German-language political TikTok comments, the authors systematically evaluate seven active learning strategies combined with four text encoders on the task of detecting anti-immigrant hostility. For the first time in a large-scale real-world setting, they compare the efficacy of human versus LLM annotations, finding that 25,974 samples labeled by GPT-5.2 (at a cost of \$43) achieve F1-Macro performance comparable to that of 3,800 human-annotated samples (costing \$316). However, the LLM systematically over-predicts the positive class on thematically ambiguous instances, exhibiting error patterns markedly distinct from those of humans—highlighting that reliance on F1 alone may obscure critical biases.

Technology Category

Application Category

📝 Abstract
Instruction-tuned LLMs can annotate thousands of instances from a short prompt at negligible cost. This raises two questions for active learning (AL): can LLM labels replace human labels within the AL loop, and does AL remain necessary when entire corpora can be labelled at once? We investigate both questions on a new dataset of 277,902 German political TikTok comments (25,974 LLM-labelled, 5,000 human-annotated), comparing seven annotation strategies across four encoders to detect anti-immigrant hostility. A classifier trained on 25,974 GPT-5.2 labels (\$43) achieves comparable F1-Macro to one trained on 3,800 human annotations (\$316). Active learning offers little advantage over random sampling in our pre-enriched pool and delivers lower F1 than full LLM annotation at the same cost. However, comparable aggregate F1 masks a systematic difference in error structure: LLM-trained classifiers over-predict the positive class relative to the human gold standard. This divergence concentrates in topically ambiguous discussions where the distinction between anti-immigrant hostility and policy critique is most subtle, suggesting that annotation strategy should be guided not by aggregate F1 alone but by the error profile acceptable for the target application.
Problem

Research questions and friction points this paper is trying to address.

active learning
human annotation
LLM annotation
hostility detection
annotation strategy
Innovation

Methods, ideas, or system contributions that make the work stand out.

active learning
LLM annotation
hostility detection
annotation bias
cost-effectiveness
🔎 Similar Papers
No similar papers found.