Evaluating the Quality of Benchmark Datasets for Low-Resource Languages: A Case Study on Turkish

📅 2025-04-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Low-resource language benchmarks—exemplified by Turkish—frequently suffer from linguistic inaccuracy and cultural misalignment, undermining NLP evaluation validity. Method: We propose the first six-dimensional dataset quality framework for low-resource languages, assessing linguistic correctness, cultural appropriateness, terminological accuracy, among others, via expert human annotation and multi-model collaborative evaluation (GPT-4o, Llama3.3-70B). Contribution/Results: Systematic evaluation of 17 mainstream Turkish benchmarks reveals that 70% fail to meet baseline quality thresholds, and 85% of assessed dimensions exhibit significant deficiencies. LLMs substantially underperform humans on cultural commonsense reasoning, yet demonstrate complementary strengths: GPT-4o excels at syntactic and terminological judgment, while Llama3.3-70B outperforms on cultural knowledge inference. This work establishes a reproducible methodology and empirically grounded benchmark for rigorous low-resource language NLP evaluation.

Technology Category

Application Category

📝 Abstract
The reliance on translated or adapted datasets from English or multilingual resources introduces challenges regarding linguistic and cultural suitability. This study addresses the need for robust and culturally appropriate benchmarks by evaluating the quality of 17 commonly used Turkish benchmark datasets. Using a comprehensive framework that assesses six criteria, both human and LLM-judge annotators provide detailed evaluations to identify dataset strengths and shortcomings. Our results reveal that 70% of the benchmark datasets fail to meet our heuristic quality standards. The correctness of the usage of technical terms is the strongest criterion, but 85% of the criteria are not satisfied in the examined datasets. Although LLM judges demonstrate potential, they are less effective than human annotators, particularly in understanding cultural common sense knowledge and interpreting fluent, unambiguous text. GPT-4o has stronger labeling capabilities for grammatical and technical tasks, while Llama3.3-70B excels at correctness and cultural knowledge evaluation. Our findings emphasize the urgent need for more rigorous quality control in creating and adapting datasets for low-resource languages.
Problem

Research questions and friction points this paper is trying to address.

Evaluating quality of Turkish benchmark datasets for low-resource languages
Assessing linguistic and cultural suitability of translated datasets
Comparing human and LLM judges in dataset evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluates 17 Turkish benchmark datasets comprehensively
Uses human and LLM judges for detailed assessments
Identifies GPT-4o and Llama3 strengths in labeling
🔎 Similar Papers
No similar papers found.
A
Aycse Aysu Cengiz
Middle East Technical University, Computer Engineering Department
A
Ahmet Kaan Sever
Bilkent University, Computer Engineering Department
E
Elif Ecem Umutlu
Middle East Technical University, Computer Engineering Department
N
Naime cSeyma Erdem
Turkcell AI
B
Burak Aytan
Turkcell AI
B
Bucsra Tufan
Hacettepe University, Sociology Department
A
Abdullah Topraksoy
Istanbul University, Linguistics Department
E
Esra Darici
Middle East Technical University, Turkish Language Department
Cagri Toraman
Cagri Toraman
Asst. Prof. Middle East Technical University, Department of Computer Engineering
natural language processinginformation retrievalsocial computing