MedLoc-R1: Performance-Aware Curriculum Reward Scheduling for GRPO-Based Medical Visual Grounding

📅 2026-03-30

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

This work addresses the challenges of sparse rewards, vanishing policy gradients, and training stagnation in medical visual localization when using reinforcement learning with fixed IoU-based reward mechanisms, particularly for small or ambiguous lesions. To overcome these limitations, the authors propose a performance-aware curriculum reward scheduling framework that dynamically adjusts reward thresholds from lenient to stringent criteria without introducing additional networks or gradient pathways. Leveraging a sliding-window performance tracker and multi-condition update rules, the method integrates dynamic curriculum learning with adaptive reward shaping within the Group Relative Policy Optimization (GRPO) framework. Evaluated on three medical localization benchmarks, the approach significantly outperforms the GRPO baseline, simultaneously enhancing localization accuracy and training stability, offering a lightweight and generalizable solution.

Technology Category

Application Category

📝 Abstract

Medical visual grounding serves as a crucial foundation for fine-grained multimodal reasoning and interpretable clinical decision support. Despite recent advances in reinforcement learning (RL) for grounding tasks, existing approaches such as Group Relative Policy Optimization~(GRPO) suffer from severe reward sparsity when directly applied to medical images, primarily due to the inherent difficulty of localizing small or ambiguous regions of interest, which is further exacerbated by the rigid and suboptimal nature of fixed IoU-based reward schemes in RL. This leads to vanishing policy gradients and stagnated optimization, particularly during early training. To address this challenge, we propose MedLoc-R1, a performance-aware reward scheduling framework that progressively tightens the reward criterion in accordance with model readiness. MedLoc-R1 introduces a sliding-window performance tracker and a multi-condition update rule that automatically adjust the reward schedule from dense, easily obtainable signals to stricter, fine-grained localization requirements, while preserving the favorable properties of GRPO without introducing auxiliary networks or additional gradient paths. Experiments on three medical visual grounding benchmarks demonstrate that MedLoc-R1 consistently improves both localization accuracy and training stability over GRPO-based baselines. Our framework offers a general, lightweight, and effective solution for RL-based grounding in high-stakes medical applications. Code \& checkpoints are available at \hyperlink{}{https://github.com/MembrAI/MedLoc-R1}.

Problem

Research questions and friction points this paper is trying to address.

medical visual grounding

reward sparsity

reinforcement learning

IoU-based reward

policy gradient vanishing

Innovation

Methods, ideas, or system contributions that make the work stand out.

reward scheduling

medical visual grounding

GRPO