One ruler to measure them all: Benchmarking multilingual long-context language models

📅 2025-03-03

📈 Citations: 0

✨ Influential: 0

career value

175K/year

🤖 AI Summary

This work investigates performance disparities among low- and high-resource languages in multilingual large language models (LLMs) with long-context capabilities (8K–128K tokens). To this end, we introduce ONERULER—the first multilingual long-context benchmark covering 26 languages and seven synthetic tasks, enabling evaluation of retrieval, aggregation, and robustness (e.g., “needle-absent” variants). It employs a two-stage localization pipeline: English instructions are first designed, then collaboratively translated by native speakers. Our experiments reveal three key findings: (1) increasing context length exacerbates cross-lingual performance gaps—Polish achieves the highest score, while English ranks only sixth; (2) cross-lingual misalignment between instructions and context induces up to 20% performance volatility; and (3) models consistently over-predict answers in “no-answer” scenarios. The benchmark is publicly released, providing a standardized evaluation framework and actionable insights for advancing multilingual long-context modeling.

Technology Category

Application Category

📝 Abstract

We present ONERULER, a multilingual benchmark designed to evaluate long-context language models across 26 languages. ONERULER adapts the English-only RULER benchmark (Hsieh et al., 2024) by including seven synthetic tasks that test both retrieval and aggregation, including new variations of the"needle-in-a-haystack"task that allow for the possibility of a nonexistent needle. We create ONERULER through a two-step process, first writing English instructions for each task and then collaborating with native speakers to translate them into 25 additional languages. Experiments with both open-weight and closed LLMs reveal a widening performance gap between low- and high-resource languages as context length increases from 8K to 128K tokens. Surprisingly, English is not the top-performing language on long-context tasks (ranked 6th out of 26), with Polish emerging as the top language. Our experiments also show that many LLMs (particularly OpenAI's o3-mini-high) incorrectly predict the absence of an answer, even in high-resource languages. Finally, in cross-lingual scenarios where instructions and context appear in different languages, performance can fluctuate by up to 20% depending on the instruction language. We hope the release of ONERULER will facilitate future research into improving multilingual and cross-lingual long-context training pipelines.

Problem

Research questions and friction points this paper is trying to address.

Evaluating multilingual long-context language models across 26 languages.

Assessing retrieval and aggregation tasks in diverse linguistic contexts.

Analyzing performance gaps between low- and high-resource languages.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multilingual benchmark for long-context models

Synthetic tasks testing retrieval and aggregation

Cross-lingual performance analysis with 26 languages

🔎 Similar Papers

No similar papers found.