Do We Still Need Humans in the Loop? Comparing Human and LLM Annotation in Active Learning for Hostility Detection

📅 2026-04-15

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

This study investigates whether human involvement remains necessary in active learning given the ability of large language models (LLMs) to inexpensively annotate entire datasets. Using a newly compiled dataset of 277,902 German-language political TikTok comments, the authors systematically evaluate seven active learning strategies combined with four text encoders on the task of detecting anti-immigrant hostility. For the first time in a large-scale real-world setting, they compare the efficacy of human versus LLM annotations, finding that 25,974 samples labeled by GPT-5.2 (at a cost of \$43) achieve F1-Macro performance comparable to that of 3,800 human-annotated samples (costing \$316). However, the LLM systematically over-predicts the positive class on thematically ambiguous instances, exhibiting error patterns markedly distinct from those of humans—highlighting that reliance on F1 alone may obscure critical biases.

Technology Category

Application Category

📝 Abstract

Instruction-tuned LLMs can annotate thousands of instances from a short prompt at negligible cost. This raises two questions for active learning (AL): can LLM labels replace human labels within the AL loop, and does AL remain necessary when entire corpora can be labelled at once? We investigate both questions on a new dataset of 277,902 German political TikTok comments (25,974 LLM-labelled, 5,000 human-annotated), comparing seven annotation strategies across four encoders to detect anti-immigrant hostility. A classifier trained on 25,974 GPT-5.2 labels (\$43) achieves comparable F1-Macro to one trained on 3,800 human annotations (\$316). Active learning offers little advantage over random sampling in our pre-enriched pool and delivers lower F1 than full LLM annotation at the same cost. However, comparable aggregate F1 masks a systematic difference in error structure: LLM-trained classifiers over-predict the positive class relative to the human gold standard. This divergence concentrates in topically ambiguous discussions where the distinction between anti-immigrant hostility and policy critique is most subtle, suggesting that annotation strategy should be guided not by aggregate F1 alone but by the error profile acceptable for the target application.

Problem

Research questions and friction points this paper is trying to address.

active learning

human annotation

LLM annotation

hostility detection

annotation strategy

Innovation

Methods, ideas, or system contributions that make the work stand out.

active learning

LLM annotation

hostility detection