🤖 AI Summary
This work investigates the transferability of reasoning capabilities across modalities and domains. We propose a two-stage post-training paradigm that relies solely on general-purpose textual data: (1) distillation-based long-chain-of-thought supervised fine-tuning, followed by (2) reinforcement learning guided by verifiable logical rewards—requiring no multimodal or domain-specific annotations. To our knowledge, this is the first empirical demonstration that pure textual post-training suffices to endow vision-language models with robust, generalizable reasoning abilities. Leveraging this insight, we instantiate X-Reasoner-Med, a medical-specialized variant. Evaluated on both general and medical multimodal and unimodal benchmarks, X-Reasoner-Med consistently outperforms state-of-the-art methods; it sets new records across multiple medical AI evaluation suites. Our approach establishes a lightweight, efficient, and scalable paradigm for cross-domain reasoning, significantly reducing reliance on costly, domain-specific labeled data.
📝 Abstract
Recent proprietary models (e.g., o3) have begun to demonstrate strong multimodal reasoning capabilities. Yet, most existing open-source research concentrates on training text-only reasoning models, with evaluations limited to mainly mathematical and general-domain tasks. Therefore, it remains unclear how to effectively extend reasoning capabilities beyond text input and general domains. This paper explores a fundamental research question: Is reasoning generalizable across modalities and domains? Our findings support an affirmative answer: General-domain text-based post-training can enable such strong generalizable reasoning. Leveraging this finding, we introduce X-Reasoner, a vision-language model post-trained solely on general-domain text for generalizable reasoning, using a two-stage approach: an initial supervised fine-tuning phase with distilled long chain-of-thoughts, followed by reinforcement learning with verifiable rewards. Experiments show that X-Reasoner successfully transfers reasoning capabilities to both multimodal and out-of-domain settings, outperforming existing state-of-the-art models trained with in-domain and multimodal data across various general and medical benchmarks (Figure 1). Additionally, we find that X-Reasoner's performance in specialized domains can be further enhanced through continued training on domain-specific text-only data. Building upon this, we introduce X-Reasoner-Med, a medical-specialized variant that achieves new state of the art on numerous text-only and multimodal medical benchmarks.