Are LLMs Ready to Replace Bangla Annotators?

📅 2026-02-18

📈 Citations: 0

✨ Influential: 0

career value

171K/year

🤖 AI Summary

This study investigates whether large language models (LLMs) can reliably substitute human annotators in the low-resource, identity-sensitive task of hate speech annotation in Bengali, with a focus on model bias and instability. Employing a unified evaluation framework, the authors systematically assess the zero-shot annotation performance of 17 LLMs, quantifying inter-annotator agreement, bias levels, and the relationship between model scale and performance. The work reveals—contrary to the prevailing “bigger is better” assumption—that increased model size does not enhance annotation quality in this sensitive context; instead, certain smaller, task-aligned models demonstrate greater stability. These findings challenge conventional scaling paradigms and underscore the limitations of current LLMs in handling sensitive content in low-resource languages.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) are increasingly used as automated annotators to scale dataset creation, yet their reliability as unbiased annotators--especially for low-resource and identity-sensitive settings--remains poorly understood. In this work, we study the behavior of LLMs as zero-shot annotators for Bangla hate speech, a task where even human agreement is challenging, and annotator bias can have serious downstream consequences. We conduct a systematic benchmark of 17 LLMs using a unified evaluation framework. Our analysis uncovers annotator bias and substantial instability in model judgments. Surprisingly, increased model scale does not guarantee improved annotation quality--smaller, more task-aligned models frequently exhibit more consistent behavior than their larger counterparts. These results highlight important limitations of current LLMs for sensitive annotation tasks in low-resource languages and underscore the need for careful evaluation before deployment.

Problem

Research questions and friction points this paper is trying to address.

LLMs

Bangla hate speech

annotator bias

low-resource languages

annotation reliability

Innovation

Methods, ideas, or system contributions that make the work stand out.

Large Language Models

Zero-shot Annotation

Annotator Bias