Do Language Models Reason Across Languages?

📅 2026-01-10
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates whether language models can effectively integrate cross-lingual information for two-hop reasoning in multilingual settings. To this end, the authors construct a multilingual two-hop question answering benchmark and propose SUBQ, a three-stage prompting method that explicitly guides models to decompose the task into sub-question generation, answer retrieval, and compositional reasoning. Experimental results reveal that current models lack faithful step-by-step reasoning mechanisms, with approximately 18% of failures attributable to the failure to decompose reasoning steps. The proposed SUBQ approach substantially improves accuracy from 10.1% to 66.5%, demonstrating the efficacy of structured prompting in enhancing multilingual multi-hop reasoning capabilities.

Technology Category

Application Category

📝 Abstract
The real-world information sources are inherently multilingual, which naturally raises a question about whether language models can synthesize information across languages. In this paper, we introduce a simple two-hop question answering setting, where answering a question requires making inferences over two multilingual documents. We find that language models are more sensitive to language variation in answer-span documents than in those providing bridging information, despite the equal importance of both documents for answering a question. Under a step-by-step sub-question evaluation, we further show that in up to 33% of multilingual cases, models fail to infer the bridging information in the first step yet still answer the overall question correctly. This indicates that reasoning in language models, especially in multilingual settings, does not follow a faithful step-by-step decomposition. Subsequently, we show that the absence of reasoning decomposition leads to around 18% composition failure, where both sub-questions are answered correctly but fail for the final two-hop questions. To mitigate this, we propose a simple three-stage SUBQ prompting method to guide the multi-step reasoning with sub-questions, which boosts accuracy from 10.1% to 66.5%.
Problem

Research questions and friction points this paper is trying to address.

multilingual reasoning
cross-lingual inference
language models
multi-hop question answering
reasoning decomposition
Innovation

Methods, ideas, or system contributions that make the work stand out.

multilingual reasoning
two-hop question answering
step-by-step decomposition
SUBQ prompting
composition failure
🔎 Similar Papers
No similar papers found.
Yan Meng
Yan Meng
Ph.D student in Language Technology Lab, University of Amsterdam
Natural Language ProcessingMachine Translation
W
Wafaa Mohammed
Language Technology Lab, University of Amsterdam
C
C. Monz
Language Technology Lab, University of Amsterdam