🤖 AI Summary
This study addresses the lack of mechanistic understanding and standardized evaluation for cross-lingual and cross-cultural commonsense reasoning in large language models (LLMs). Methodologically, we (1) propose the first fine-grained, skill-oriented taxonomy for multilingual commonsense reasoning; (2) design a data synthesis pipeline integrating template-based generation and knowledge injection, with explicit support for multilingual alignment and cultural adaptation; and (3) establish a dynamic complexity grading framework grounded in model capability feedback. Evaluating eight state-of-the-art LLMs reveals substantial performance degradation on high-complexity and cross-cultural commonsense tasks, exposing fundamental limitations in current reasoning-augmentation techniques. Our benchmark—the first skill-directed, multilingual commonsense reasoning evaluation suite—introduces a scalable, interpretable paradigm for assessing multilingual reasoning capabilities, enabling granular diagnostics of linguistic, cultural, and cognitive bottlenecks in LLMs.
📝 Abstract
Recent advancements in reasoning-reinforced Large Language Models (LLMs) have shown remarkable capabilities in complex reasoning tasks. However, the mechanism underlying their utilization of different human reasoning skills remains poorly investigated, especially for multilingual commonsense reasoning that involves everyday knowledge across different languages and cultures. To address this gap, we propose a extbf{M}ultilingual and Scalable Benchmark for extbf{S}kill-based extbf{Co}mmonsense extbf{Re}asoning ( extbf{mSCoRe}). Our benchmark incorporates three key components that are designed to systematically evaluate LLM's reasoning capabilities, including: (1) a novel taxonomy of reasoning skills that enables fine-grained analysis of models' reasoning processes, (2) a robust data synthesis pipeline tailored specifically for commonsense reasoning evaluation, and (3) a complexity scaling framework allowing task difficulty to scale dynamically alongside future improvements in LLM abilities. Extensive experiments on eights state-of-the-art LLMs of varying sizes and training approaches demonstrate that extbf{mSCoRe} remains significantly challenging for current models, particularly at higher complexity levels. Our results reveal the limitations of such reasoning-reinforced models when confronted with nuanced multilingual general and cultural commonsense. We further provide detailed analysis on the models' reasoning processes, suggesting future directions for improving multilingual commonsense reasoning capabilities.