Evaluating Modern Large Language Models on Low-Resource and Morphologically Rich Languages:A Cross-Lingual Benchmark Across Cantonese, Japanese, and Turkish

📅 2025-11-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the evaluation of large language models (LLMs) on low-resource, morphologically complex languages—Cantonese, Japanese, and Turkish—where existing benchmarks lack cultural adaptability and morphological sensitivity. To bridge this gap, we introduce the first human-evaluated, multilingual (trilingual), multi-task benchmark covering question answering, summarization, translation, and culturally grounded dialogue. Our evaluation integrates fluency, factual accuracy, and cultural appropriateness, complemented by automated metrics (BLEU, ROUGE). Experiments span seven prominent models, including GPT-4o, Claude 3.5, and LLaMA variants. Results reveal that proprietary models outperform open-weight counterparts overall, yet all struggle significantly with Turkish agglutination and Cantonese colloquialism; open-weight small models lag substantially in both accuracy and fluency. This work provides the first systematic empirical evidence of morphological generalization and cultural understanding bottlenecks in LLMs, establishing a foundational evaluation paradigm for truly linguistically inclusive models.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) have achieved impressive results in high-resource languages like English, yet their effectiveness in low-resource and morphologically rich languages remains underexplored. In this paper, we present a comprehensive evaluation of seven cutting-edge LLMs -- including GPT-4o, GPT-4, Claude~3.5~Sonnet, LLaMA~3.1, Mistral~Large~2, LLaMA-2~Chat~13B, and Mistral~7B~Instruct -- on a new cross-lingual benchmark covering extbf{Cantonese, Japanese, and Turkish}. Our benchmark spans four diverse tasks: open-domain question answering, document summarization, English-to-X translation, and culturally grounded dialogue. We combine extbf{human evaluations} (rating fluency, factual accuracy, and cultural appropriateness) with automated metrics (e.g., BLEU, ROUGE) to assess model performance. Our results reveal that while the largest proprietary models (GPT-4o, GPT-4, Claude~3.5) generally lead across languages and tasks, significant gaps persist in culturally nuanced understanding and morphological generalization. Notably, GPT-4o demonstrates robust multilingual performance even on cross-lingual tasks, and Claude~3.5~Sonnet achieves competitive accuracy on knowledge and reasoning benchmarks. However, all models struggle to some extent with the unique linguistic challenges of each language, such as Turkish agglutinative morphology and Cantonese colloquialisms. Smaller open-source models (LLaMA-2~13B, Mistral~7B) lag substantially in fluency and accuracy, highlighting the resource disparity. We provide detailed quantitative results, qualitative error analysis, and discuss implications for developing more culturally aware and linguistically generalizable LLMs. Our benchmark and evaluation data are released to foster reproducibility and further research.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs on low-resource and morphologically rich languages
Assessing performance gaps in cultural understanding and morphology
Benchmarking models across Cantonese, Japanese, and Turkish tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluated seven LLMs on low-resource languages
Used human and automated metrics for assessment
Created cross-lingual benchmark for diverse tasks
🔎 Similar Papers
No similar papers found.
C
Chengxuan Xia
University of California, Santa Cruz, CA, USA
Q
Qianye Wu
Carnegie Mellon University , Pittsburgh, P A, USA
H
Hongbin Guan
Carnegie Mellon University , Pittsburgh, P A, USA
S
Sixuan Tian
Carnegie Mellon University , Pittsburgh, P A, USA
Yilun Hao
Yilun Hao
Massachusetts Institute of Technology
RoboticsLarge Language ModelsMachine LearningPlanning
Xiaoyu Wu
Xiaoyu Wu
Central University of Finance and Economics
development economicslabor economicshealth economics