🤖 AI Summary
This study investigates the temporal dynamics of annotation consistency in prolonged, small-scale labeling efforts, revealing a significant decline over time. The authors construct a sentiment dataset comprising 3,565 Setswana-language tweets, enriched with fine-grained timestamps, and employ Randolph’s free-marginal Kappa to demonstrate— for the first time—that temporal proximity is a critical predictor of inter-annotator agreement: κ reaches 0.98 for annotations within the same minute but drops to 0.65 across days. The analysis further uncovers frequent confusion at the negative–neutral boundary and evidence of annotator “autopilot” behavior. Evaluations of multilingual encoders and large language models—including GPT-5 and Gemini—on this three-class sentiment task show that fine-tuned models outperform baselines by 29–43 macro-F1 points, with GPT-5 achieving the best few-shot performance (62.2 F1). The dataset, timestamps, and code are publicly released to support quality auditing in African language NLP.
📝 Abstract
Annotation quality is difficult to sustain when campaigns span weeks or months with small annotator pools. We present a Setswana sentiment dataset of 3,565 tweets annotated by three native-speaker annotators across eight batches and examine why inter-annotator agreement (IAA) declines over time. Despite an aggregate Randolph's free-marginal Kappa of $κ= 0.76$, "excellent," per-batch $κ$ falls by more than 32 points across the annotation task. Through six targeted analyses, we find that (i) label confusion concentrates on the negative/neutral boundary, (ii) two annotators show run-length drift consistent with autopilot labeling, and (iii) the dominant predictor of $κ$ is temporal simultaneity: tweets labeled within one minute achieve $κ= 0.98$, while those labeled more than a day apart reach only $κ= 0.65$. Annotation speed and tweet-level linguistic features show no meaningful association with $κ$. We benchmark three open multilingual encoders and proprietary models (GPT-5 and Gemini) on three-class sentiment classification; fine-tuning yields gains of 29 to 43 macro-F1 points over pretrained baselines, with GPT-5 few-shot leading overall (62.2 macro-F1). We release the dataset, per-annotation timestamps, and analysis code to support reproducible quality auditing for future African language NLP resources.