Med3D-R1: Incentivizing Clinical Reasoning in 3D Medical Vision-Language Models for Abnormality Diagnosis

📅 2026-02-01

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

Existing 3D medical vision-language models lack reliable clinical reasoning capabilities for anomaly diagnosis, often overfitting to superficial patterns in reports and exhibiting poor interpretability. To address these limitations, this work proposes Med3D-R1, a novel framework trained through supervised fine-tuning followed by reinforcement learning. The approach introduces a residual alignment mechanism to mitigate the modality gap, employs an anomaly-aware reweighting strategy to alleviate structural bias, and incorporates a consistency-based reward to explicitly guide step-by-step clinical reasoning. Evaluated on two 3D diagnostic benchmarks—CT-RATE and RAD-ChestCT—the method achieves state-of-the-art accuracy of 41.92% and 44.99%, respectively, setting new performance records on both datasets.

Technology Category

Application Category

📝 Abstract

Developing 3D vision-language models with robust clinical reasoning remains a challenge due to the inherent complexity of volumetric medical imaging, the tendency of models to overfit superficial report patterns, and the lack of interpretability-aware reward designs. In this paper, we propose Med3D-R1, a reinforcement learning framework with a two-stage training process: Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL). During SFT stage, we introduce a residual alignment mechanism to bridge the gap between high-dimensional 3D features and textual embeddings, and an abnormality re-weighting strategy to emphasize clinically informative tokens and reduce structural bias in reports. In RL stage, we redesign the consistency reward to explicitly promote coherent, step-by-step diagnostic reasoning. We evaluate our method on medical multiple-choice visual question answering using two 3D diagnostic benchmarks, CT-RATE and RAD-ChestCT, where our model attains state-of-the-art accuracies of 41.92\% on CT-RATE and 44.99\% on RAD-ChestCT. These results indicate improved abnormality diagnosis and clinical reasoning and outperform prior methods on both benchmarks. Overall, our approach holds promise for enhancing real-world diagnostic workflows by enabling more reliable and transparent 3D medical vision-language systems.

Problem

Research questions and friction points this paper is trying to address.

clinical reasoning

3D medical vision-language models

abnormality diagnosis

interpretability

volumetric medical imaging

Innovation

Methods, ideas, or system contributions that make the work stand out.

3D vision-language model

clinical reasoning

reinforcement learning