🤖 AI Summary
Medical foundation models for chest X-ray (CXR) interpretation lack transparent, spatially grounded reasoning processes, hindering clinical adoption. To address this, we propose DeepMedix-R1—the first generative CXR foundation model supporting localized, interpretable reasoning. Methodologically, we introduce an end-to-end sequential training paradigm integrating instruction tuning, cold-starting with high-quality synthetic reasoning data, and online reinforcement learning to jointly model answer generation and image-localized reasoning paths. We further design Report Arena, an automated evaluation framework enabling multi-dimensional assessment of interpretability. Experiments demonstrate that DeepMedix-R1 significantly outperforms baselines—including LLaVA-Rad and MedGemma—on radiology report generation and visual question answering. Crucially, expert radiologist evaluations confirm that its stepwise reasoning exhibits substantially higher clinical acceptability than Qwen2.5-VL-7B.
📝 Abstract
Medical foundation models (FMs) have shown tremendous promise amid the rapid advancements in artificial intelligence (AI) technologies. However, current medical FMs typically generate answers in a black-box manner, lacking transparent reasoning processes and locally grounded interpretability, which hinders their practical clinical deployments. To this end, we introduce DeepMedix-R1, a holistic medical FM for chest X-ray (CXR) interpretation. It leverages a sequential training pipeline: initially fine-tuned on curated CXR instruction data to equip with fundamental CXR interpretation capabilities, then exposed to high-quality synthetic reasoning samples to enable cold-start reasoning, and finally refined via online reinforcement learning to enhance both grounded reasoning quality and generation performance. Thus, the model produces both an answer and reasoning steps tied to the image's local regions for each query. Quantitative evaluation demonstrates substantial improvements in report generation (e.g., 14.54% and 31.32% over LLaVA-Rad and MedGemma) and visual question answering (e.g., 57.75% and 23.06% over MedGemma and CheXagent) tasks. To facilitate robust assessment, we propose Report Arena, a benchmarking framework using advanced language models to evaluate answer quality, further highlighting the superiority of DeepMedix-R1. Expert review of generated reasoning steps reveals greater interpretability and clinical plausibility compared to the established Qwen2.5-VL-7B model (0.7416 vs. 0.2584 overall preference). Collectively, our work advances medical FM development toward holistic, transparent, and clinically actionable modeling for CXR interpretation.