Evaluating Modern Large Language Models on Low-Resource and Morphologically Rich Languages:A Cross-Lingual Benchmark Across Cantonese, Japanese, and Turkish

📅 2025-11-05

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

This study addresses the evaluation of large language models (LLMs) on low-resource, morphologically complex languages—Cantonese, Japanese, and Turkish—where existing benchmarks lack cultural adaptability and morphological sensitivity. To bridge this gap, we introduce the first human-evaluated, multilingual (trilingual), multi-task benchmark covering question answering, summarization, translation, and culturally grounded dialogue. Our evaluation integrates fluency, factual accuracy, and cultural appropriateness, complemented by automated metrics (BLEU, ROUGE). Experiments span seven prominent models, including GPT-4o, Claude 3.5, and LLaMA variants. Results reveal that proprietary models outperform open-weight counterparts overall, yet all struggle significantly with Turkish agglutination and Cantonese colloquialism; open-weight small models lag substantially in both accuracy and fluency. This work provides the first systematic empirical evidence of morphological generalization and cultural understanding bottlenecks in LLMs, establishing a foundational evaluation paradigm for truly linguistically inclusive models.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) have achieved impressive results in high-resource languages like English, yet their effectiveness in low-resource and morphologically rich languages remains underexplored. In this paper, we present a comprehensive evaluation of seven cutting-edge LLMs -- including GPT-4o, GPT-4, Claude~3.5~Sonnet, LLaMA~3.1, Mistral~Large~2, LLaMA-2~Chat~13B, and Mistral~7B~Instruct -- on a new cross-lingual benchmark covering extbf{Cantonese, Japanese, and Turkish}. Our benchmark spans four diverse tasks: open-domain question answering, document summarization, English-to-X translation, and culturally grounded dialogue. We combine extbf{human evaluations} (rating fluency, factual accuracy, and cultural appropriateness) with automated metrics (e.g., BLEU, ROUGE) to assess model performance. Our results reveal that while the largest proprietary models (GPT-4o, GPT-4, Claude~3.5) generally lead across languages and tasks, significant gaps persist in culturally nuanced understanding and morphological generalization. Notably, GPT-4o demonstrates robust multilingual performance even on cross-lingual tasks, and Claude~3.5~Sonnet achieves competitive accuracy on knowledge and reasoning benchmarks. However, all models struggle to some extent with the unique linguistic challenges of each language, such as Turkish agglutinative morphology and Cantonese colloquialisms. Smaller open-source models (LLaMA-2~13B, Mistral~7B) lag substantially in fluency and accuracy, highlighting the resource disparity. We provide detailed quantitative results, qualitative error analysis, and discuss implications for developing more culturally aware and linguistically generalizable LLMs. Our benchmark and evaluation data are released to foster reproducibility and further research.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs on low-resource and morphologically rich languages

Assessing performance gaps in cultural understanding and morphology

Benchmarking models across Cantonese, Japanese, and Turkish tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluated seven LLMs on low-resource languages

Used human and automated metrics for assessment

Created cross-lingual benchmark for diverse tasks

🔎 Similar Papers

How Well Do LLMs Handle Cantonese? Benchmarking Cantonese Capabilities of Large Language Models