Large Language Model Hacking: Quantifying the Hidden Risks of Using LLMs for Text Annotation

📅 2025-09-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study identifies systemic bias risks arising from using large language models (LLMs) for text annotation in social science research. We define and empirically quantify “LLM hacking”—unintended statistical inference biases introduced by researcher choices in model selection, prompt engineering, temperature setting, and other implementation decisions—leading to Type I, II, S, or M errors. Through a multi-model comparative experiment across 37 annotation tasks, 13 million annotations, 18 LLMs, and 2,361 hypothesis tests—augmented by human-annotated baselines and systematic prompt ablation—we find: (i) ~33% of hypotheses yield erroneous conclusions even with state-of-the-art models; error rates exceed 50% for smaller models; (ii) increasing effect size mitigates but does not eliminate bias; (iii) conventional regression-based calibration offers limited control over false positives; and (iv) human annotation substantially improves result robustness. The work provides a critical methodological warning and an empirical evaluation framework for LLM-augmented social science research.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) are rapidly transforming social science research by enabling the automation of labor-intensive tasks like data annotation and text analysis. However, LLM outputs vary significantly depending on the implementation choices made by researchers (e.g., model selection, prompting strategy, or temperature settings). Such variation can introduce systematic biases and random errors, which propagate to downstream analyses and cause Type I, Type II, Type S, or Type M errors. We call this LLM hacking. We quantify the risk of LLM hacking by replicating 37 data annotation tasks from 21 published social science research studies with 18 different models. Analyzing 13 million LLM labels, we test 2,361 realistic hypotheses to measure how plausible researcher choices affect statistical conclusions. We find incorrect conclusions based on LLM-annotated data in approximately one in three hypotheses for state-of-the-art models, and in half the hypotheses for small language models. While our findings show that higher task performance and better general model capabilities reduce LLM hacking risk, even highly accurate models do not completely eliminate it. The risk of LLM hacking decreases as effect sizes increase, indicating the need for more rigorous verification of findings near significance thresholds. Our extensive analysis of LLM hacking mitigation techniques emphasizes the importance of human annotations in reducing false positive findings and improving model selection. Surprisingly, common regression estimator correction techniques are largely ineffective in reducing LLM hacking risk, as they heavily trade off Type I vs. Type II errors. Beyond accidental errors, we find that intentional LLM hacking is unacceptably simple. With few LLMs and just a handful of prompt paraphrases, anything can be presented as statistically significant.
Problem

Research questions and friction points this paper is trying to address.

Quantifying hidden risks of LLM-based text annotation
Measuring systematic biases from researcher implementation choices
Assessing statistical error propagation in downstream analyses
Innovation

Methods, ideas, or system contributions that make the work stand out.

Replicates 37 annotation tasks with 18 models
Tests 2361 hypotheses measuring researcher choice impacts
Analyzes 13 million LLM labels for error quantification
🔎 Similar Papers
No similar papers found.