🤖 AI Summary
This study investigates the systematic impact of LLM-assisted annotation on human judgment and downstream evaluation in subjectivity tasks. Employing a pre-registered experimental design, we deployed two LLMs across two subjectivity datasets under three AI-assistance conditions, collecting over 7,000 annotations from 410 crowdworkers. Results show that LLM suggestions did not improve annotation efficiency but significantly increased annotator confidence and induced systematic label distribution shifts—termed “annotation drift.” This drift inflates estimated LLM performance when evaluated on assisted annotations, thereby compromising validity in social science research. To our knowledge, this is the first empirical demonstration of annotation drift in subjective tasks induced by LLM assistance. The findings underscore the urgent need to redesign human-AI collaborative annotation protocols and provide methodological warnings for annotation quality control and benchmark construction in subjective NLP evaluation.
📝 Abstract
LLM use in annotation is becoming widespread, and given LLMs' overall promising performance and speed, simply "reviewing" LLM annotations in interpretive tasks can be tempting. In subjective annotation tasks with multiple plausible answers, reviewing LLM outputs can change the label distribution, impacting both the evaluation of LLM performance, and analysis using these labels in a social science task downstream. We conducted a pre-registered experiment with 410 unique annotators and over 7,000 annotations testing three AI assistance conditions against controls, using two models, and two datasets. We find that presenting crowdworkers with LLM-generated annotation suggestions did not make them faster, but did improve their self-reported confidence in the task. More importantly, annotators strongly took the LLM suggestions, significantly changing the label distribution compared to the baseline. When these labels created with LLM assistance are used to evaluate LLM performance, reported model performance significantly increases. We believe our work underlines the importance of understanding the impact of LLM-assisted annotation on subjective, qualitative tasks, on the creation of gold data for training and testing, and on the evaluation of NLP systems on subjective tasks.