Disentangling Language Roles in Multilingual LLM Task Execution

📅 2026-05-26

📈 Citations: 0

✨ Influential: 0

career value

166K/year

🤖 AI Summary

Existing evaluation frameworks struggle to disentangle the individual contributions of instruction language, content language, and response language in multilingual large language model tasks. This work proposes MTM-Bench, a novel benchmark that employs a fully crossed design spanning 27 language-role combinations to systematically assess each dimension’s impact on performance. By decoupling multilingual tasks into three independent axes, the study reveals that performance degradation is primarily driven by the response language rather than the number of language mismatches. The authors introduce fine-grained metrics—including semantic correctness, target-language adherence, constraint satisfaction, contamination ratio, and joint success rate—validated through human evaluation. Experiments across 20 state-of-the-art and open-source models demonstrate that mismatched response language alone causes significant performance drops, with failure modes varying by task type, thereby highlighting the insufficiency of relying solely on semantic correctness as an evaluation metric.

📝 Abstract

Multilingual LLMs are increasingly used when instruction, source content, and required response languages do not coincide. Existing benchmarks have expanded multilingual instruction-following evaluation, but they rarely isolate these three roles within a fully crossed design. We introduce MTM-Bench, a controlled benchmark for language-conditioned task execution in which each instance is defined by a triplet \((L_{\text{instr}}, L_{\text{content}}, L_{\text{resp}})\). Across English, Spanish, and Chinese, MTM-Bench enumerates all 27 triplets and contains 2{,}430 instances per model across semantic reversal, final-state extraction, and language purity with update realization. We evaluate 20 frontier and open-weight LLMs using decomposed metrics for semantic correctness, target-language adherence, constraint satisfaction, contamination ratio, and joint success, with scoring validated by a targeted human audit. The fully crossed design reveals that degradation is organized by the role a language occupies in the task structure, not merely by mismatch count. The response-language role is the dominant axis of variation, and a single response-slot mismatch accounts for most degradation. The response-only and full-mismatch comparison suggests that mismatch count is not a monotonic predictor of difficulty, with model-level ordering varying across systems. Task families fail through distinct channels, showing that semantic correctness alone does not capture reliable multilingual task execution.

Problem

Research questions and friction points this paper is trying to address.

multilingual LLMs

language roles

instruction-following

task execution

language mismatch

Innovation

Methods, ideas, or system contributions that make the work stand out.

multilingual LLMs

language roles disentanglement

fully crossed benchmark

response-language dominance

decomposed evaluation metrics

🔎 Similar Papers

Is Translation All You Need? A Study on Solving Multilingual Tasks with Large Language Models

2024-03-15arXiv.orgCitations: 8

Sharing Matters: Analysing Neurons Across Languages and Tasks in LLMs

2024-06-13arXiv.orgCitations: 23