🤖 AI Summary
Existing evaluation frameworks inadequately assess large language models’ (LLMs) multilingual long-context reasoning capabilities: mainstream benchmarks predominantly rely on retrieval tasks, suffering from data leakage, shortcut learning, and insufficient emphasis on deep reasoning. To address this, we propose MLRBench—the first synthetic benchmark explicitly designed for multilingual long-context reasoning, covering seven languages and targeting three core competencies: multi-hop reasoning, factual aggregation, and cognitive reasoning. Our contributions include: (1) the first systematic decoupling of information retrieval from deep reasoning; (2) a robust, length-scalable, cross-lingually parallel evaluation framework resistant to data leakage; and (3) the empirical discovery that effective context utilization across languages rarely exceeds 30%. Experiments reveal significant performance disparities between high- and low-resource languages, with RAG only partially alleviating this bottleneck. The benchmark is publicly released to advance research in multilingual long-context reasoning.
📝 Abstract
Existing multilingual long-context benchmarks, often based on the popular needle-in-a-haystack test, primarily evaluate a model's ability to locate specific information buried within irrelevant texts. However, such a retrieval-centric approach is myopic and inherently limited, as successful recall alone does not indicate a model's capacity to reason over extended contexts. Moreover, these benchmarks are susceptible to data leakage, short-circuiting, and risk making the evaluation a priori identifiable. To address these limitations, we introduce MLRBench, a new synthetic benchmark for multilingual long-context reasoning. Unlike existing benchmarks, MLRBench goes beyond surface-level retrieval by including tasks that assess multi-hop inference, aggregation, and epistemic reasoning. Spanning seven languages, MLRBench is designed to be parallel, resistant to leakage, and scalable to arbitrary context lengths. Our extensive experiments with an open-weight large language model (LLM) reveal a pronounced gap between high- and low-resource languages, particularly for tasks requiring the model to aggregate multiple facts or predict the absence of information. We also find that, in multilingual settings, LLMs effectively utilize less than 30% of their claimed context length. Although off-the-shelf Retrieval Augmented Generation helps alleviate this to a certain extent, it does not solve the long-context problem. We open-source MLRBench to enable future research in improved evaluation and training of multilingual LLMs.