MedGround-R1: Advancing Medical Image Grounding via Spatial-Semantic Rewarded Group Relative Policy Optimization

📅 2025-07-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Medical image grounding (MIG) requires precise localization of image regions conditioned on textual descriptions while modeling spatial relationships—yet existing vision-language models (VLMs) rely on costly, scarce chain-of-thought (CoT) annotations for supervised fine-tuning. To address this, we propose an end-to-end optimization framework that eliminates the need for CoT supervision. Our method introduces a spatial-semantic joint reward mechanism and employs Group Relative Policy Optimization—a novel reinforcement learning algorithm—alongside a Chain-of-Box reasoning template that explicitly encodes spatial logic among bounding boxes and enforces alignment with textual semantics. Evaluated on three major benchmarks—MS-CXR, ChestX-ray8, and M3D-RefSeg—our approach achieves state-of-the-art performance. Ablation studies confirm the critical roles of spatial reward, semantic reward, and the Chain-of-Box template. This work presents the first CoT-free VLM framework for joint spatial-semantic optimization in medical grounding, substantially reducing annotation dependency while enhancing robustness and interpretability.

Technology Category

Application Category

📝 Abstract
Medical Image Grounding (MIG), which involves localizing specific regions in medical images based on textual descriptions, requires models to not only perceive regions but also deduce spatial relationships of these regions. Existing Vision-Language Models (VLMs) for MIG often rely on Supervised Fine-Tuning (SFT) with large amounts of Chain-of-Thought (CoT) reasoning annotations, which are expensive and time-consuming to acquire. Recently, DeepSeek-R1 demonstrated that Large Language Models (LLMs) can acquire reasoning abilities through Group Relative Policy Optimization (GRPO) without requiring CoT annotations. In this paper, we adapt the GRPO reinforcement learning framework to VLMs for Medical Image Grounding. We propose the Spatial-Semantic Rewarded Group Relative Policy Optimization to train the model without CoT reasoning annotations. Specifically, we introduce Spatial-Semantic Rewards, which combine spatial accuracy reward and semantic consistency reward to provide nuanced feedback for both spatially positive and negative completions. Additionally, we propose to use the Chain-of-Box template, which integrates visual information of referring bounding boxes into the <think> reasoning process, enabling the model to explicitly reason about spatial regions during intermediate steps. Experiments on three datasets MS-CXR, ChestX-ray8, and M3D-RefSeg demonstrate that our method achieves state-of-the-art performance in Medical Image Grounding. Ablation studies further validate the effectiveness of each component in our approach. Code, checkpoints, and datasets are available at https://github.com/bio-mlhui/MedGround-R1
Problem

Research questions and friction points this paper is trying to address.

Localizing medical image regions via textual descriptions
Reducing reliance on expensive Chain-of-Thought annotations
Improving spatial-semantic reasoning in Vision-Language Models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Spatial-Semantic Rewarded GRPO for VLMs
Chain-of-Box template integrates visual reasoning
Combines spatial accuracy and semantic rewards
🔎 Similar Papers
No similar papers found.
H
Huihui Xu
Shanghai Artificial Intelligence Laboratory, Shanghai, China
Y
Yuanpeng Nie
Department of Nephrology, The Seventh Affiliated Hospital, Sun Yat-sen University, Shenzhen, China
H
Hualiang Wang
The Hong Kong University of Science and Technology, Hong Kong SAR, China
Y
Ying Chen
Shanghai Artificial Intelligence Laboratory, Shanghai, China
W
Wei Li
Shanghai Artificial Intelligence Laboratory, Shanghai, China
J
Junzhi Ning
Shanghai Artificial Intelligence Laboratory, Shanghai, China
Lihao Liu
Lihao Liu
Amazon
LLM-based AgentHealthcare AI
Hongqiu Wang
Hongqiu Wang
Hong Kong University of Science and Technology (Guangzhou)
AI for healthcareLabel-efficient learningMulti-modal learningFairnessMLLM
L
Lei Zhu
The Hong Kong University of Science and Technology, Hong Kong SAR, China; The Hong Kong University of Science and Technology (Guangzhou), China
J
Jiyao Liu
Shanghai Artificial Intelligence Laboratory, Shanghai, China; Fudan University, Shanghai, China
Xiaomeng Li
Xiaomeng Li
Assistant Professor, The Hong Kong University of Science and Technology
Medical Image AnalysisAI in HealthcareDeep Learning
Junjun He
Junjun He
Shanghai Jiao Tong University