🤖 AI Summary
This study addresses the evaluation of large language models (LLMs) on low-resource, morphologically complex languages—Cantonese, Japanese, and Turkish—where existing benchmarks lack cultural adaptability and morphological sensitivity. To bridge this gap, we introduce the first human-evaluated, multilingual (trilingual), multi-task benchmark covering question answering, summarization, translation, and culturally grounded dialogue. Our evaluation integrates fluency, factual accuracy, and cultural appropriateness, complemented by automated metrics (BLEU, ROUGE). Experiments span seven prominent models, including GPT-4o, Claude 3.5, and LLaMA variants. Results reveal that proprietary models outperform open-weight counterparts overall, yet all struggle significantly with Turkish agglutination and Cantonese colloquialism; open-weight small models lag substantially in both accuracy and fluency. This work provides the first systematic empirical evidence of morphological generalization and cultural understanding bottlenecks in LLMs, establishing a foundational evaluation paradigm for truly linguistically inclusive models.
📝 Abstract
Large language models (LLMs) have achieved impressive results in high-resource languages like English, yet their effectiveness in low-resource and morphologically rich languages remains underexplored. In this paper, we present a comprehensive evaluation of seven cutting-edge LLMs -- including GPT-4o, GPT-4, Claude~3.5~Sonnet, LLaMA~3.1, Mistral~Large~2, LLaMA-2~Chat~13B, and Mistral~7B~Instruct -- on a new cross-lingual benchmark covering extbf{Cantonese, Japanese, and Turkish}. Our benchmark spans four diverse tasks: open-domain question answering, document summarization, English-to-X translation, and culturally grounded dialogue. We combine extbf{human evaluations} (rating fluency, factual accuracy, and cultural appropriateness) with automated metrics (e.g., BLEU, ROUGE) to assess model performance. Our results reveal that while the largest proprietary models (GPT-4o, GPT-4, Claude~3.5) generally lead across languages and tasks, significant gaps persist in culturally nuanced understanding and morphological generalization. Notably, GPT-4o demonstrates robust multilingual performance even on cross-lingual tasks, and Claude~3.5~Sonnet achieves competitive accuracy on knowledge and reasoning benchmarks. However, all models struggle to some extent with the unique linguistic challenges of each language, such as Turkish agglutinative morphology and Cantonese colloquialisms. Smaller open-source models (LLaMA-2~13B, Mistral~7B) lag substantially in fluency and accuracy, highlighting the resource disparity. We provide detailed quantitative results, qualitative error analysis, and discuss implications for developing more culturally aware and linguistically generalizable LLMs. Our benchmark and evaluation data are released to foster reproducibility and further research.