🤖 AI Summary
This work addresses the challenges of implicit reference and reasoning in 3D medical image segmentation for free-form clinical queries by proposing MedVol-R1, a novel framework that introduces, for the first time in volumetric segmentation, a reward-driven explicit evidence anchoring mechanism. The approach decouples evidence localization and segmentation via reinforcement learning: a large vision-language model identifies verifiable 2D evidence anchors, which are then used by a frozen MedSAM2 module to produce consistent 3D masks. Combining cold-start supervised fine-tuning with the GRPO algorithm, the method employs a multi-component reward function to guide informative evidence selection, precise 2D localization, and cross-slice consistency. Evaluated on CT-ORG, AbdomenCT-1K, and KiTS23, MedVol-R1 achieves state-of-the-art performance, significantly outperforming strong baselines, demonstrating the effective gain of reinforcement learning over purely supervised training, and enabling interpretable, highly generalizable segmentation without requiring chain-of-thought annotations.
📝 Abstract
Volumetric Reasoning Segmentation (VRS) aims to segment a target region in a 3D medical scan from a free-form clinical query, where the referent is often implicit and requires both medical knowledge and volume-grounded reasoning. Existing methods typically rely on specialized segmentation tokens to connect language with mask decoding, but this coupling collapses the decision process into opaque latent representations, limiting interpretability and generalization to diverse narrative expressions. In this paper, we present MedVol-R1, a reinforcement learning-based framework for VRS that explicitly decouples evidence grounding from volumetric delineation: the LVLM grounds clinical reasoning to a verifiable 2D evidence anchor (key axial slice and 2D bounding boxes), which is then propagated into a coherent 3D mask by a frozen MedSAM2 module. We train MedVol-R1 with cold-start supervised fine-tuning followed by GRPO, guided by a multi-component reward that encourages informative evidence selection, accurate 2D spatial grounding, and cross-slice volumetric coherence, without requiring costly chain-of-thought annotations. Experiments on CT-ORG, AbdomenCT-1K, and KiTS23 from the M3D-Seg benchmark demonstrate that MedVol-R1 consistently outperforms strong baselines and achieves state-of-the-art performance, with reinforcement learning providing clear gains over pure supervised fine-tuning.