Are LLMs Ready to Replace Bangla Annotators?

📅 2026-02-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates whether large language models (LLMs) can reliably substitute human annotators in the low-resource, identity-sensitive task of hate speech annotation in Bengali, with a focus on model bias and instability. Employing a unified evaluation framework, the authors systematically assess the zero-shot annotation performance of 17 LLMs, quantifying inter-annotator agreement, bias levels, and the relationship between model scale and performance. The work reveals—contrary to the prevailing “bigger is better” assumption—that increased model size does not enhance annotation quality in this sensitive context; instead, certain smaller, task-aligned models demonstrate greater stability. These findings challenge conventional scaling paradigms and underscore the limitations of current LLMs in handling sensitive content in low-resource languages.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) are increasingly used as automated annotators to scale dataset creation, yet their reliability as unbiased annotators--especially for low-resource and identity-sensitive settings--remains poorly understood. In this work, we study the behavior of LLMs as zero-shot annotators for Bangla hate speech, a task where even human agreement is challenging, and annotator bias can have serious downstream consequences. We conduct a systematic benchmark of 17 LLMs using a unified evaluation framework. Our analysis uncovers annotator bias and substantial instability in model judgments. Surprisingly, increased model scale does not guarantee improved annotation quality--smaller, more task-aligned models frequently exhibit more consistent behavior than their larger counterparts. These results highlight important limitations of current LLMs for sensitive annotation tasks in low-resource languages and underscore the need for careful evaluation before deployment.
Problem

Research questions and friction points this paper is trying to address.

LLMs
Bangla hate speech
annotator bias
low-resource languages
annotation reliability
Innovation

Methods, ideas, or system contributions that make the work stand out.

Large Language Models
Zero-shot Annotation
Annotator Bias
Low-resource Languages
Hate Speech Detection
🔎 Similar Papers
No similar papers found.
M
Md. Najib Hasan
A2I Lab, School of Computing, Wichita State University
T
Touseef Hasan
A2I Lab, School of Computing, Wichita State University
Souvika Sarkar
Souvika Sarkar
Wichita State University
Natural Language ProcessingInformation RetrievalMachine LearningArtificial Intelligence