Improving Medical Visual Reinforcement Fine-Tuning via Perception and Reasoning Augmentation

📅 2026-02-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing reinforcement fine-tuning (RFT) methods struggle to simultaneously achieve precise perception and structured reasoning in medical vision tasks, limiting the deployment of multimodal large models in high-stakes clinical settings. This work proposes VRFT-Aug, a novel framework that introduces, for the first time, a synergistic enhancement mechanism coupling perception and reasoning into medical visual RFT. By integrating prior knowledge injection, perception-driven policy optimization, clinically informed reward shaping, and behavioral imitation, VRFT-Aug establishes a generalizable paradigm for training strategies and reward design. Experiments demonstrate that the proposed method significantly outperforms standard supervised fine-tuning and conventional RFT approaches across multiple medical datasets, offering transferable insights for training robust, clinically aligned vision-language models.

Technology Category

Application Category

📝 Abstract
While recent advances in Reinforcement Fine-Tuning (RFT) have shown that rule-based reward schemes can enable effective post-training for large language models, their extension to cross-modal, vision-centric domains remains largely underexplored. This limitation is especially pronounced in the medical imaging domain, where effective performance requires both robust visual perception and structured reasoning. In this work, we address this gap by proposing VRFT-Aug, a visual reinforcement fine-tuning framework tailored for the medical domain. VRFT-Aug introduces a series of training strategies designed to augment both perception and reasoning, including prior knowledge injection, perception-driven policy refinement, medically informed reward shaping, and behavioral imitation. Together, these methods aim to stabilize and improve the RFT process. Through extensive experiments across multiple medical datasets, we show that our approaches consistently outperform both standard supervised fine-tuning and RFT baselines. Moreover, we provide empirically grounded insights and practical training heuristics that can be generalized to other medical image tasks. We hope this work contributes actionable guidance and fresh inspiration for the ongoing effort to develop reliable, reasoning-capable models for high-stakes medical applications.
Problem

Research questions and friction points this paper is trying to address.

Reinforcement Fine-Tuning
Medical Imaging
Visual Perception
Structured Reasoning
Cross-modal Learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Visual Reinforcement Fine-Tuning
Perception-Reasoning Augmentation
Medical Imaging
Reward Shaping
Cross-modal Learning
🔎 Similar Papers
No similar papers found.
G
Guangjing Yang
Beijing University of Posts and Telecommunications
Z
ZhangYuan Yu
Beijing University of Posts and Telecommunications
Ziyuan Qin
Ziyuan Qin
Emory University
Multi-modality ModelsLarge Language ModelsMedical Image Analysis
Xinyuan Song
Xinyuan Song
Emory University
Statisticsmachine learning
H
Huahui Yi
Sichuan University
Qingbo Kang
Qingbo Kang
West China Hospital
Deep LearningMedical Image AnalysisComputer Vision
J
Jun Gao
Sichuan University
Y
Yiyue Li
Sichuan University
Chenlin Du
Chenlin Du
Peking University
Biomedical EngineeringDeep LearningDigital Dentistry
Q
Qicheng Lao
Beijing University of Posts and Telecommunications