X-Reasoner: Towards Generalizable Reasoning Across Modalities and Domains

📅 2025-05-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates the transferability of reasoning capabilities across modalities and domains. We propose a two-stage post-training paradigm that relies solely on general-purpose textual data: (1) distillation-based long-chain-of-thought supervised fine-tuning, followed by (2) reinforcement learning guided by verifiable logical rewards—requiring no multimodal or domain-specific annotations. To our knowledge, this is the first empirical demonstration that pure textual post-training suffices to endow vision-language models with robust, generalizable reasoning abilities. Leveraging this insight, we instantiate X-Reasoner-Med, a medical-specialized variant. Evaluated on both general and medical multimodal and unimodal benchmarks, X-Reasoner-Med consistently outperforms state-of-the-art methods; it sets new records across multiple medical AI evaluation suites. Our approach establishes a lightweight, efficient, and scalable paradigm for cross-domain reasoning, significantly reducing reliance on costly, domain-specific labeled data.

Technology Category

Application Category

📝 Abstract
Recent proprietary models (e.g., o3) have begun to demonstrate strong multimodal reasoning capabilities. Yet, most existing open-source research concentrates on training text-only reasoning models, with evaluations limited to mainly mathematical and general-domain tasks. Therefore, it remains unclear how to effectively extend reasoning capabilities beyond text input and general domains. This paper explores a fundamental research question: Is reasoning generalizable across modalities and domains? Our findings support an affirmative answer: General-domain text-based post-training can enable such strong generalizable reasoning. Leveraging this finding, we introduce X-Reasoner, a vision-language model post-trained solely on general-domain text for generalizable reasoning, using a two-stage approach: an initial supervised fine-tuning phase with distilled long chain-of-thoughts, followed by reinforcement learning with verifiable rewards. Experiments show that X-Reasoner successfully transfers reasoning capabilities to both multimodal and out-of-domain settings, outperforming existing state-of-the-art models trained with in-domain and multimodal data across various general and medical benchmarks (Figure 1). Additionally, we find that X-Reasoner's performance in specialized domains can be further enhanced through continued training on domain-specific text-only data. Building upon this, we introduce X-Reasoner-Med, a medical-specialized variant that achieves new state of the art on numerous text-only and multimodal medical benchmarks.
Problem

Research questions and friction points this paper is trying to address.

Extending reasoning capabilities beyond text input and general domains
Exploring generalizable reasoning across modalities and domains
Enhancing specialized domain performance via domain-specific text training
Innovation

Methods, ideas, or system contributions that make the work stand out.

General-domain text post-training enables cross-modal reasoning
Two-stage approach: fine-tuning plus reinforcement learning
Domain-specific text boosts specialized reasoning performance
🔎 Similar Papers
No similar papers found.
Qianchu Liu
Qianchu Liu
Microsoft Research
Natural Language Processing
S
Sheng Zhang
Microsoft Research
Guanghui Qin
Guanghui Qin
Microsoft
machine learninghealthcare
T
Timothy Ossowski
Microsoft Research
Y
Yu Gu
Microsoft Research
Y
Ying Jin
Microsoft Research
Sid Kiblawi
Sid Kiblawi
Microsoft
Computational BiologyMachine LearningNLP
S
Sam Preston
Microsoft Research
M
Mu Wei
Microsoft Research
P
Paul Vozila
Microsoft Research
Tristan Naumann
Tristan Naumann
Principal Researcher, Microsoft Research Health Futures
Artificial IntelligenceMachine LearningNatural Language ProcessingClinical Inference
H
H. Poon
Microsoft Research