Comparing LLM Text Annotation Skills: A Study on Human Rights Violations in Social Media Data

πŸ“… 2025-05-15
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This study systematically evaluates the zero-shot and few-shot binary classification performance of GPT-3.5, GPT-4, LLaMA-3, Mistral-7B, and Claude-2 on human rights violation content in Russian and Ukrainian social media texts, using dual human annotation (ΞΊ = 0.91) as the gold standard. Method: We introduce a novel cross-lingual (English-to-Russian) prompting framework and a fine-grained error analysis pipeline to rigorously assess model reliability on sensitive, subjective, and low-resource human rights annotation tasks. Contribution/Results: GPT-4 achieves the highest performance (F1 = 0.82), yet all models fall significantly short of human inter-annotator agreement in ambiguous contexts. Russian-language prompts improve non-English models’ accuracy by an average of 11%, underscoring the critical role of linguistic adaptation for robustness. This work establishes the first reproducible benchmark and practical guidelines for AI-assisted annotation in high-stakes, multilingual human rights monitoring.

Technology Category

Application Category

πŸ“ Abstract
In the era of increasingly sophisticated natural language processing (NLP) systems, large language models (LLMs) have demonstrated remarkable potential for diverse applications, including tasks requiring nuanced textual understanding and contextual reasoning. This study investigates the capabilities of multiple state-of-the-art LLMs - GPT-3.5, GPT-4, LLAMA3, Mistral 7B, and Claude-2 - for zero-shot and few-shot annotation of a complex textual dataset comprising social media posts in Russian and Ukrainian. Specifically, the focus is on the binary classification task of identifying references to human rights violations within the dataset. To evaluate the effectiveness of these models, their annotations are compared against a gold standard set of human double-annotated labels across 1000 samples. The analysis includes assessing annotation performance under different prompting conditions, with prompts provided in both English and Russian. Additionally, the study explores the unique patterns of errors and disagreements exhibited by each model, offering insights into their strengths, limitations, and cross-linguistic adaptability. By juxtaposing LLM outputs with human annotations, this research contributes to understanding the reliability and applicability of LLMs for sensitive, domain-specific tasks in multilingual contexts. It also sheds light on how language models handle inherently subjective and context-dependent judgments, a critical consideration for their deployment in real-world scenarios.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' zero-shot and few-shot annotation skills for human rights violations in social media posts
Comparing LLM performance against human annotations in multilingual text classification
Analyzing error patterns and cross-linguistic adaptability of LLMs in sensitive domain tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluates multiple LLMs for zero-shot and few-shot annotation
Compares model annotations with human double-annotated labels
Assesses cross-linguistic performance using English and Russian prompts