Can We Hide Machines in the Crowd? Quantifying Equivalence in LLM-in-the-loop Annotation Tasks

📅 2025-10-08

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

Existing evaluation paradigms for large language models (LLMs) in text annotation tasks emphasize output correctness, neglecting whether LLMs can statistically emulate human subjective judgment. Method: We propose “indistinguishability” as a novel evaluation criterion and introduce the first statistical framework integrating Krippendorff’s α, paired bootstrap resampling, and Two One-Sided Tests (TOST) to quantify behavioral equivalence between LLM and human annotators. Contribution/Results: Empirical validation on MovieLens-100K demonstrates statistical indistinguishability (p = 0.004), while PolitiFact yields significant distinguishability (p = 0.155), confirming the framework’s predictive validity regarding task suitability. This work establishes a verifiable statistical foundation and methodological pipeline for integrating LLMs into human-in-the-loop crowdsourcing annotation workflows.

Technology Category

Application Category

📝 Abstract

Many evaluations of large language models (LLMs) in text annotation focus primarily on the correctness of the output, typically comparing model-generated labels to human-annotated ``ground truth'' using standard performance metrics. In contrast, our study moves beyond effectiveness alone. We aim to explore how labeling decisions -- by both humans and LLMs -- can be statistically evaluated across individuals. Rather than treating LLMs purely as annotation systems, we approach LLMs as an alternative annotation mechanism that may be capable of mimicking the subjective judgments made by humans. To assess this, we develop a statistical evaluation method based on Krippendorff's $α$, paired bootstrapping, and the Two One-Sided t-Tests (TOST) equivalence test procedure. This evaluation method tests whether an LLM can blend into a group of human annotators without being distinguishable. We apply this approach to two datasets -- MovieLens 100K and PolitiFact -- and find that the LLM is statistically indistinguishable from a human annotator in the former ($p = 0.004$), but not in the latter ($p = 0.155$), highlighting task-dependent differences. It also enables early evaluation on a small sample of human data to inform whether LLMs are suitable for large-scale annotation in a given application.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLM-human annotation equivalence statistically

Testing if LLMs mimic human subjective judgments

Assessing task-dependent LLM integration in annotation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Statistical equivalence testing using Krippendorff's alpha

Bootstrapping and TOST for human-LLM indistinguishability

Early small-sample evaluation for LLM annotation suitability

🔎 Similar Papers

LLMs instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasks

2024-06-26arXiv.orgCitations: 69