🤖 AI Summary
To address poor interpretability and weak clinical alignment in medical multimodal reasoning, this paper proposes MedE², a two-stage post-training framework. In Stage I, structured textual reasoning chains from 2,000 curated examples are used to elicit logical reasoning capabilities. In Stage II, 1,500 high-quality multimodal medical cases guide fine-tuning via reasoning-path supervision and a novel multimodal preference alignment loss, enabling precise alignment of model behavior with clinical knowledge and decision-making practices. This work introduces the first “elicitation–enhancement” paradigm for medical multimodal reasoning and establishes the inaugural preference alignment mechanism tailored to this domain. Experiments demonstrate that MedE² achieves significant and robust improvements over strong baselines across multiple medical multimodal benchmarks—across varying model scales and inference-time scaling configurations—while simultaneously enhancing both diagnostic accuracy and clinical plausibility.
📝 Abstract
Effective clinical decision-making depends on iterative, multimodal reasoning across diverse sources of evidence. The recent emergence of multimodal reasoning models has significantly transformed the landscape of solving complex tasks. Although such models have achieved notable success in mathematics and science, their application to medical domains remains underexplored. In this work, we propose extit{MedE$^2$}, a two-stage post-training pipeline that elicits and then enhances multimodal reasoning for medical domains. In Stage-I, we fine-tune models using 2,000 text-only data samples containing precisely orchestrated reasoning demonstrations to elicit reasoning behaviors. In Stage-II, we further enhance the model's reasoning capabilities using 1,500 rigorously curated multimodal medical cases, aligning model reasoning outputs with our proposed multimodal medical reasoning preference. Extensive experiments demonstrate the efficacy and reliability of extit{MedE$^2$} in improving the reasoning performance of medical multimodal models. Notably, models trained with extit{MedE$^2$} consistently outperform baselines across multiple medical multimodal benchmarks. Additional validation on larger models and under inference-time scaling further confirms the robustness and practical utility of our approach.