The Promises and Pitfalls of LLM Annotations in Dataset Labeling: a Case Study on Media Bias Detection

📅 2024-11-17
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
High-quality labeled data for media bias classification is costly and labor-intensive to obtain manually. Method: We propose the first LLM-driven synthetic data generation paradigm for this task, leveraging models such as GPT-4 and Claude to design zero-shot/few-shot annotation, instruction-tuning, and synthetic data cleaning pipelines, yielding Annolexical—a large-scale, high-quality synthetic dataset of 48K+ samples. Contribution/Results: We introduce the first framework for synthetic label quality validation and behavioral stress testing, systematically uncovering latent biases and trade-off boundaries in LLM-generated annotations. Experiments show that classifiers trained on LLM-annotated data outperform all annotating LLMs by 5–9 percentage points in Matthews Correlation Coefficient (MCC), match or exceed human-annotated baselines, and demonstrate strong cross-benchmark generalization on BABE and BASIL—while substantially reducing data curation costs.

Technology Category

Application Category

📝 Abstract
High annotation costs from hiring or crowdsourcing complicate the creation of large, high-quality datasets needed for training reliable text classifiers. Recent research suggests using Large Language Models (LLMs) to automate the annotation process, reducing these costs while maintaining data quality. LLMs have shown promising results in annotating downstream tasks like hate speech detection and political framing. Building on the success in these areas, this study investigates whether LLMs are viable for annotating the complex task of media bias detection and whether a downstream media bias classifier can be trained on such data. We create annolexical, the first large-scale dataset for media bias classification with over 48000 synthetically annotated examples. Our classifier, fine-tuned on this dataset, surpasses all of the annotator LLMs by 5-9 percent in Matthews Correlation Coefficient (MCC) and performs close to or outperforms the model trained on human-labeled data when evaluated on two media bias benchmark datasets (BABE and BASIL). This study demonstrates how our approach significantly reduces the cost of dataset creation in the media bias domain and, by extension, the development of classifiers, while our subsequent behavioral stress-testing reveals some of its current limitations and trade-offs.
Problem

Research questions and friction points this paper is trying to address.

Text Classification
Data Annotation
Media Bias
Innovation

Methods, ideas, or system contributions that make the work stand out.

Large Language Models
Media Bias Dataset
Cost Reduction
🔎 Similar Papers
No similar papers found.
Tomáš Horych
Tomáš Horych
University of Göttingen, Göttingen, Germany
C
Christoph Mandl
University of Göttingen, Göttingen, Germany
Terry Ruas
Terry Ruas
University of Göttingen (Prev: Uni. of Michigan, NII Tokyo, Uni. of Wuppertal, UFABC)
Natural Language ProcessingLexical SemanticsText GenerationParaphrasing
A
André Greiner-Petter
National Institute of Informatics, Tokyo, Japan
B
Bela Gipp
University of Göttingen, Göttingen, Germany
A
Akiko Aizawa
National Institute of Informatics, Tokyo, Japan
Timo Spinde
Timo Spinde
NII Tokyo / Media Bias Group
Media BiasAINatural Language Processing