Cetvel: A Unified Benchmark for Evaluating Language Understanding, Generation and Cultural Capacity of LLMs for Turkish

📅 2025-08-22

📈 Citations: 0

✨ Influential: 0

career value

175K/year

🤖 AI Summary

Existing Turkish-language benchmarks suffer from task narrowness and insufficient cultural grounding, limiting comprehensive evaluation of LLMs’ linguistic understanding, generation, and cultural competence. To address this, we introduce TurkEval—the first comprehensive, culturally embedded evaluation benchmark for Turkish—comprising 23 diverse tasks spanning grammatical error correction, machine translation, and history/idiom-based question answering. TurkEval uniquely integrates both discriminative and generative tasks while deeply incorporating local cultural context. We conduct standardized evaluation across 33 open-source LLMs (up to 70B parameters), covering multiple architectures and instruction-tuning paradigms. Results show that multilingual models consistently outperform Turkish-specific models; grammatical error correction and extractive QA exhibit the highest task discrimination. TurkEval establishes a more valid and reliable foundation for Turkish LLM assessment, introducing a novel methodological framework for culturally aware language evaluation.

Technology Category

Application Category

📝 Abstract

We introduce Cetvel, a comprehensive benchmark designed to evaluate large language models (LLMs) in Turkish. Existing Turkish benchmarks often lack either task diversity or culturally relevant content, or both. Cetvel addresses these gaps by combining a broad range of both discriminative and generative tasks ensuring content that reflects the linguistic and cultural richness of Turkish language. Cetvel covers 23 tasks grouped into seven categories, including tasks such as grammatical error correction, machine translation, and question answering rooted in Turkish history and idiomatic language. We evaluate 33 open-weight LLMs (up to 70B parameters) covering different model families and instruction paradigms. Our experiments reveal that Turkish-centric instruction-tuned models generally underperform relative to multilingual or general-purpose models (e.g. Llama 3 and Mistral), despite being tailored for the language. Moreover, we show that tasks such as grammatical error correction and extractive question answering are particularly discriminative in differentiating model capabilities. Cetvel offers a comprehensive and culturally grounded evaluation suite for advancing the development and assessment of LLMs in Turkish.

Problem

Research questions and friction points this paper is trying to address.

Evaluating Turkish LLMs' language understanding, generation, and cultural capacity

Addressing lack of task diversity and cultural relevance in Turkish benchmarks

Assessing discriminative and generative tasks reflecting Turkish linguistic richness

Innovation

Methods, ideas, or system contributions that make the work stand out.

Comprehensive benchmark combining discriminative and generative tasks

Culturally relevant content reflecting Turkish linguistic richness

Evaluation suite for Turkish language model capabilities

🔎 Similar Papers

No similar papers found.