MedVLThinker: Simple Baselines for Multimodal Medical Reasoning

📅 2025-08-04

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

Current large medical foundation models lack an open, reproducible paradigm for developing robust reasoning capabilities. Method: We propose RLVR—a novel reinforcement learning framework grounded in answer verifiability—specifically designed for multimodal medical reasoning. RLVR integrates difficulty-aware reasoning data curation, supervised fine-tuning, and chain-of-thought generation, systematically enhancing reasoning on the Qwen2.5-VL series. Contribution/Results: Our analysis reveals that pure-text reasoning data substantially outperforms multimodal data—a finding that challenges prevailing assumptions. The 7B variant achieves new open-source state-of-the-art performance across six medical QA benchmarks; the 32B variant matches GPT-4o’s accuracy. To foster reproducibility and community advancement, we fully open-source all training data, model weights, and implementation code.

Technology Category

Application Category

📝 Abstract

Large Reasoning Models (LRMs) have introduced a new paradigm in AI by enabling models to ``think before responding" via chain-of-thought reasoning. However, the absence of open and reproducible recipes for building reasoning-centric medical LMMs hinders community-wide research, analysis, and comparison. In this paper, we present MedVLThinker, a suite of simple yet strong baselines. Our fully open recipe consists of: (1) systematic data curation for both text-only and image-text medical data, filtered according to varying levels of reasoning difficulty, and (2) two training paradigms: Supervised Fine-Tuning (SFT) on distilled reasoning traces and Reinforcement Learning with Verifiable Rewards (RLVR) based on final answer correctness. Across extensive experiments on the Qwen2.5-VL model family (3B, 7B) and six medical QA benchmarks, we find that RLVR consistently and significantly outperforms SFT. Additionally, under the RLVR framework, a key, counter-intuitive finding is that training on our curated text-only reasoning data provides a more substantial performance boost than training on multimodal image-text data. Our best open 7B model, trained using the RLVR recipe on text-only data, establishes a new state-of-the-art on existing public VQA benchmarks, surpassing all previous open-source medical LMMs. Furthermore, scaling our model to 32B achieves performance on par with the proprietary GPT-4o. We release all curated data, models, and code to provide the community with a strong, open foundation for future research in multimodal medical reasoning.

Problem

Research questions and friction points this paper is trying to address.

Lack of open recipes for medical reasoning models

Need for systematic data curation in medical AI

Performance gap between text-only and multimodal training

Innovation

Methods, ideas, or system contributions that make the work stand out.

Systematic data curation for medical reasoning

Reinforcement Learning with Verifiable Rewards

Training on text-only data boosts performance

🔎 Similar Papers

Alifuse: Aligning and Fusing Multimodal Medical Data for Computer-Aided Diagnosis