Toward Effective Reinforcement Learning Fine-Tuning for Medical VQA in Vision-Language Models

📅 2025-05-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address insufficient clinical alignment of multimodal large language models (MLLMs) in medical visual question answering (VQA), this paper proposes GRPO, a reinforcement learning (RL) fine-tuning paradigm explicitly designed for clinical semantic fidelity and reasoning rigor. GRPO systematically analyzes the failure mechanisms of standard RL in medical VQA and innovatively integrates three components: medical knowledge–guided reward modeling, semantic alignment loss, and length-adaptive reward shaping. It jointly enhances four critical dimensions—foundation model initialization, clinical semantic alignment, response-length optimization, and bias mitigation. Evaluated across multiple medical VQA benchmarks, GRPO outperforms supervised fine-tuning (SFT) by +4.2% in accuracy and +27% in long-chain reasoning quality. Generated answers exhibit stronger adherence to clinical guidelines and evidence-based reasoning principles, substantially overcoming key limitations of generic RL methods in domain-specific long-horizon reasoning and bias control.

Technology Category

Application Category

📝 Abstract
Recently, reinforcement learning (RL)-based tuning has shifted the trajectory of Multimodal Large Language Models (MLLMs), particularly following the introduction of Group Relative Policy Optimization (GRPO). However, directly applying it to medical tasks remains challenging for achieving clinically grounded model behavior. Motivated by the need to align model response with clinical expectations, we investigate four critical dimensions that affect the effectiveness of RL-based tuning in medical visual question answering (VQA): base model initialization strategy, the role of medical semantic alignment, the impact of length-based rewards on long-chain reasoning, and the influence of bias. We conduct extensive experiments to analyze these factors for medical MLLMs, providing new insights into how models are domain-specifically fine-tuned. Additionally, our results also demonstrate that GRPO-based RL tuning consistently outperforms standard supervised fine-tuning (SFT) in both accuracy and reasoning quality.
Problem

Research questions and friction points this paper is trying to address.

Enhancing RL-based tuning for medical VQA in vision-language models
Aligning model responses with clinical expectations in medical tasks
Analyzing key factors affecting RL tuning effectiveness in medical MLLMs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses GRPO for reinforcement learning fine-tuning
Focuses on medical semantic alignment in VQA
Analyzes length-based rewards for reasoning quality
🔎 Similar Papers
No similar papers found.