ChiKhaPo: A Large-Scale Multilingual Benchmark for Evaluating Lexical Comprehension and Generation in Large Language Models

📅 2025-10-19

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

Existing LLM evaluation benchmarks focus predominantly on high- and medium-resource languages and advanced reasoning tasks, neglecting fundamental lexical competence—the core building block of language understanding—across the vast majority of the world’s 3,800+ written languages. Method: We introduce LexiBench, the first large-scale multilingual benchmark covering 2,700+ languages, centered on lexical understanding and generation. It comprises eight hierarchically structured subtasks, built exclusively from dictionary, monolingual, and bilingual lexical data. We further conduct multi-dimensional analysis incorporating language families, resource availability, and typological features. Contribution/Results: Experiments across six mainstream LLMs reveal severe performance degradation on low-resource languages, with average accuracy below 30%. LexiBench is the first multitask benchmark explicitly designed for foundational lexical evaluation at unprecedented scale. It systematically uncovers critical limitations in current multilingual LLMs and provides a rigorous, empirically grounded evaluation framework to guide future research.

Technology Category

Application Category

📝 Abstract

Existing benchmarks for large language models (LLMs) are largely restricted to high- or mid-resource languages, and often evaluate performance on higher-order tasks in reasoning and generation. However, plenty of evidence points to the fact that LLMs lack basic linguistic competence in the vast majority of the world's 3800+ written languages. We introduce ChiKhaPo, consisting of 8 subtasks of varying difficulty designed to evaluate the lexical comprehension and generation abilities of generative models. ChiKhaPo draws on existing lexicons, monolingual data, and bitext, and provides coverage for 2700+ languages for 2 subtasks, surpassing any existing benchmark in terms of language coverage. We further show that 6 SOTA models struggle on our benchmark, and discuss the factors contributing to performance scores, including language family, language resourcedness, task, and comprehension versus generation directions. With ChiKhaPo, we hope to enable and encourage the massively multilingual benchmarking of LLMs.

Problem

Research questions and friction points this paper is trying to address.

Evaluating lexical comprehension and generation in large language models

Addressing lack of linguistic competence in low-resource languages

Providing multilingual benchmark coverage for 2700+ languages

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multilingual benchmark for lexical comprehension and generation

Covers 2700+ languages using lexicons and bitext

Evaluates eight subtasks across diverse language families

🔎 Similar Papers

No similar papers found.