Old Experience Helps: Leveraging Survey Methodology to Improve AI Text Annotation Reliability in Social Sciences

📅 2025-02-27

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

This paper addresses the insufficient evaluation of labeling reliability in social science research using large language models (LLMs). We propose a scalable, survey methodology–inspired framework for internal consistency assessment. Our approach innovatively incorporates item/position randomization and reverse-validation interventions, coupled with a fine-grained R-score based on KL divergence to effectively distinguish random guessing from semantically grounded annotations. Experiments on the F1000 benchmark across Llama-8B, 70B, and 405B models reveal that 5–25% of annotations exhibit significant sensitivity to interventions—indicating lower reliability—and that larger models demonstrate greater stability. Notably, approximately 50% of rare-class annotations deemed “correct” by conventional metrics (e.g., accuracy or F1) are found to be low-reliability. The R-score substantially improves detection of ambiguous or borderline cases, enabling more robust, interpretable, and trustworthy LLM-based annotation in social science applications.

Technology Category

Application Category

📝 Abstract

This paper introduces a framework for assessing the reliability of Large Language Model (LLM) text annotations in social science research by adapting established survey methodology principles. Drawing parallels between survey respondent behavior and LLM outputs, the study implements three key interventions: option randomization, position randomization, and reverse validation. While traditional accuracy metrics may mask model instabilities, particularly in edge cases, our framework provides a more comprehensive reliability assessment. Using the F1000 dataset in biomedical science and three sizes of Llama models (8B, 70B, and 405B parameters), the paper demonstrates that these survey-inspired interventions can effectively identify unreliable annotations that might otherwise go undetected through accuracy metrics alone. The results show that 5-25% of LLM annotations change under these interventions, with larger models exhibiting greater stability. Notably, for rare categories approximately 50% of"correct"annotations demonstrate low reliability when subjected to this framework. The paper introduce an information-theoretic reliability score (R-score) based on Kullback-Leibler divergence that quantifies annotation confidence and distinguishes between random guessing and meaningful annotations at the case level. This approach complements existing expert validation methods by providing a scalable way to assess internal annotation reliability and offers practical guidance for prompt design and downstream analysis.

Problem

Research questions and friction points this paper is trying to address.

Assess LLM text annotation reliability

Adapt survey methodology principles

Identify unreliable annotations effectively

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adapts survey methodology principles

Implements option and position randomization

Introduces Kullback-Leibler based reliability score

🔎 Similar Papers

Prompt Selection Matters: Enhancing Text Annotations for Social Sciences with Large Language Models