MultiNRC: A Challenging and Native Multilingual Reasoning Evaluation Benchmark for LLMs

📅 2025-07-23

📈 Citations: 0

✨ Influential: 0

career value

163K/year

🤖 AI Summary

Existing multilingual reasoning benchmarks predominantly rely on machine- or human-translated versions of English-centric evaluations, introducing linguistic and cultural biases that hinder accurate assessment of models’ genuine cross-cultural reasoning capabilities. To address this, we introduce MultiNRC—the first native-speaker-authored multilingual reasoning benchmark covering French, Spanish, and Chinese (1,000+ items), explicitly designed to probe language-specific phenomena, wordplay, cultural traditions, and culturally embedded mathematical reasoning. Each item is accompanied by a meticulously human-translated English counterpart to enable fair cross-lingual evaluation. We construct a four-task taxonomy and systematically evaluate 14 state-of-the-art large language models. Results show average accuracy below 50% across models, with a pronounced performance gap—approximately 10 percentage points lower—on culturally grounded mathematical items in native languages versus their English translations, revealing a systemic deficit in culturally rooted knowledge understanding.

Technology Category

Application Category

📝 Abstract

Although recent Large Language Models (LLMs) have shown rapid improvement on reasoning benchmarks in English, the evaluation of such LLMs' multilingual reasoning capability across diverse languages and cultural contexts remains limited. Existing multilingual reasoning benchmarks are typically constructed by translating existing English reasoning benchmarks, biasing these benchmarks towards reasoning problems with context in English language/cultures. In this work, we introduce the Multilingual Native Reasoning Challenge (MultiNRC), a benchmark designed to assess LLMs on more than 1,000 native, linguistic and culturally grounded reasoning questions written by native speakers in French, Spanish, and Chinese. MultiNRC covers four core reasoning categories: language-specific linguistic reasoning, wordplay & riddles, cultural/tradition reasoning, and math reasoning with cultural relevance. For cultural/tradition reasoning and math reasoning with cultural relevance, we also provide English equivalent translations of the multilingual questions by manual translation from native speakers fluent in English. This set of English equivalents can provide a direct comparison of LLM reasoning capacity in other languages vs. English on the same reasoning questions. We systematically evaluate current 14 leading LLMs covering most LLM families on MultiNRC and its English equivalent set. The results show that (1) current LLMs are still not good at native multilingual reasoning, with none scoring above 50% on MultiNRC; (2) LLMs exhibit distinct strengths and weaknesses in handling linguistic, cultural, and logical reasoning tasks; (3) Most models perform substantially better in math reasoning in English compared to in original languages (+10%), indicating persistent challenges with culturally grounded knowledge.

Problem

Research questions and friction points this paper is trying to address.

Evaluating multilingual reasoning in LLMs across diverse languages and cultures

Assessing native reasoning with culturally grounded questions in French, Spanish, Chinese

Comparing LLM reasoning performance between native languages and English equivalents

Innovation

Methods, ideas, or system contributions that make the work stand out.

Native multilingual reasoning benchmark creation

Manual translation for English equivalent questions

Evaluation across diverse reasoning categories

🔎 Similar Papers

M4U: Evaluating Multilingual Understanding and Reasoning for Large Multimodal Models