🤖 AI Summary
This work addresses the lack of systematic evaluation of large language models’ capabilities in understanding, reasoning, and generation within code-mixed (multilingual) contexts. To this end, we introduce CodeMixQA, a benchmark comprising high-quality, human-annotated parallel corpora spanning 16 geographic regions and code-mixing patterns, supporting both original scripts and transliterated forms. Through question-answering tasks, the benchmark comprehensively assesses models’ comprehension of mixed-language inputs, cross-lingual reasoning consistency, and the fluency and semantic fidelity of generated outputs. Our study is the first to systematically uncover critical limitations of current large language models in code-mixing scenarios, thereby establishing an empirical foundation and standardized evaluation framework for developing more robust multilingual models.
📝 Abstract
Code-switching is a pervasive phenomenon in multilingual communication, yet the robustness of large language models (LLMs) in mixed-language settings remains insufficiently understood. In this work, we present a comprehensive evaluation of LLM capabilities in understanding, reasoning over, and generating code-switched text. We introduce CodeMixQA a novel benchmark with high-quality human annotations, comprising 16 diverse parallel code-switched language-pair variants that span multiple geographic regions and code-switching patterns, and include both original scripts and their transliterated forms. Using this benchmark, we analyze the reasoning behavior of LLMs on code-switched question-answering tasks, shedding light on how models process and reason over mixed-language inputs. We further conduct a systematic evaluation of LLM-generated synthetic code-switched text, focusing on both naturalness and semantic fidelity, and uncover key limitations in current generation capabilities. Our findings reveal persistent challenges in both reasoning and generation under code-switching conditions and provide actionable insights for building more robust multilingual LLMs. We release the dataset and code as open source.