MedLoc-R1: Performance-Aware Curriculum Reward Scheduling for GRPO-Based Medical Visual Grounding

📅 2026-03-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenges of sparse rewards, vanishing policy gradients, and training stagnation in medical visual localization when using reinforcement learning with fixed IoU-based reward mechanisms, particularly for small or ambiguous lesions. To overcome these limitations, the authors propose a performance-aware curriculum reward scheduling framework that dynamically adjusts reward thresholds from lenient to stringent criteria without introducing additional networks or gradient pathways. Leveraging a sliding-window performance tracker and multi-condition update rules, the method integrates dynamic curriculum learning with adaptive reward shaping within the Group Relative Policy Optimization (GRPO) framework. Evaluated on three medical localization benchmarks, the approach significantly outperforms the GRPO baseline, simultaneously enhancing localization accuracy and training stability, offering a lightweight and generalizable solution.
📝 Abstract
Medical visual grounding serves as a crucial foundation for fine-grained multimodal reasoning and interpretable clinical decision support. Despite recent advances in reinforcement learning (RL) for grounding tasks, existing approaches such as Group Relative Policy Optimization~(GRPO) suffer from severe reward sparsity when directly applied to medical images, primarily due to the inherent difficulty of localizing small or ambiguous regions of interest, which is further exacerbated by the rigid and suboptimal nature of fixed IoU-based reward schemes in RL. This leads to vanishing policy gradients and stagnated optimization, particularly during early training. To address this challenge, we propose MedLoc-R1, a performance-aware reward scheduling framework that progressively tightens the reward criterion in accordance with model readiness. MedLoc-R1 introduces a sliding-window performance tracker and a multi-condition update rule that automatically adjust the reward schedule from dense, easily obtainable signals to stricter, fine-grained localization requirements, while preserving the favorable properties of GRPO without introducing auxiliary networks or additional gradient paths. Experiments on three medical visual grounding benchmarks demonstrate that MedLoc-R1 consistently improves both localization accuracy and training stability over GRPO-based baselines. Our framework offers a general, lightweight, and effective solution for RL-based grounding in high-stakes medical applications. Code \& checkpoints are available at \hyperlink{}{https://github.com/MembrAI/MedLoc-R1}.
Problem

Research questions and friction points this paper is trying to address.

medical visual grounding
reward sparsity
reinforcement learning
IoU-based reward
policy gradient vanishing
Innovation

Methods, ideas, or system contributions that make the work stand out.

reward scheduling
medical visual grounding
GRPO
performance-aware
reinforcement learning
🔎 Similar Papers
No similar papers found.
G
Guangjing Yang
Beijing University of Posts and Telecommunications
Ziyuan Qin
Ziyuan Qin
Emory University
Multi-modality ModelsLarge Language ModelsMedical Image Analysis
C
Chaoran Zhang
Beijing University of Posts and Telecommunications
Chenlin Du
Chenlin Du
Peking University
Biomedical EngineeringDeep LearningDigital Dentistry
Jinlin Wang
Jinlin Wang
DeepWisdom
Computer Vision、Multi-Agent System、Large Language Model、Large Vision-Language Model
W
Wanran Sun
Beijing University of Posts and Telecommunications
Z
Zhenyu Zhang
Beijing University of Posts and Telecommunications
B
Bing Ji
Shandong University
Q
Qicheng Lao
Beijing University of Posts and Telecommunications