Think Globally, Group Locally: Evaluating LLMs Using Multi-Lingual Word Grouping Games

📅 2025-10-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work identifies a significant language bias in large language models (LLMs) for cross-lingual abstract reasoning: performance consistently degrades in non-English languages, and closed-source models outperform open-source ones. To systematically evaluate this issue, we introduce GlobalGroup—the first benchmark specifically designed for assessing language bias in abstract reasoning. It covers five typologically diverse languages (English, Spanish, Chinese, Hindi, Arabic) plus their English translations, grounded in a word-association grouping paradigm. GlobalGroup incorporates translation alignment and difficulty quantification to enable formula-free, controllable, and cross-lingually comparable evaluation. Experimental results provide the first empirical evidence that LLMs exhibit language-modality bias in pattern recognition and divergent thinking tasks—revealing systematic disparities across linguistic contexts. Our benchmark establishes a novel methodological foundation and empirical basis for advancing research on fairness and equity in multilingual reasoning.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) can exhibit biases in reasoning capabilities due to linguistic modality, performing better on tasks in one language versus another, even with similar content. Most previous works evaluate this through reasoning tasks where reliance on strategies or knowledge can ensure success, such as in commonsense or math tasks. However, abstract reasoning is vital to reasoning for everyday life, where people apply "out-of-the-box thinking" to identify and use patterns for solutions, without a reliance on formulaic approaches. Comparatively, little work has evaluated linguistic biases in this task type. In this paper, we propose a task inspired by the New York Times Connections: GlobalGroup, that evaluates models in an abstract reasoning task across several languages. We constructed a game benchmark with five linguistic backgrounds -- English, Spanish, Chinese, Hindi, and Arabic -- in both the native language and an English translation for comparison. We also proposed game difficulty measurements to evaluate models on games with similar difficulty, enabling a more controlled comparison, which is particularly important in reasoning evaluations. Through experimentation, we find English modalities largely lead to better performance in this abstract reasoning task, and performance disparities between open- and closed-source models.
Problem

Research questions and friction points this paper is trying to address.

Evaluating linguistic biases in abstract reasoning across languages
Assessing LLM performance disparities between language modalities
Measuring abstract reasoning without formulaic approaches or knowledge reliance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-lingual word grouping game benchmark
Game difficulty measurements for controlled comparison
Abstract reasoning evaluation across five languages
🔎 Similar Papers
No similar papers found.