Difference Feedback: Generating Multimodal Process-Level Supervision for VLM Reinforcement Learning

📅 2026-03-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of sparse credit assignment in multi-step visual reasoning, where reliance solely on terminal rewards weakens the association between visual evidence and intermediate reasoning steps, leading to optimization instability and visual hallucinations. To mitigate this, the authors propose a differential feedback mechanism that automatically corrects erroneous reasoning trajectories by generating token- or step-level supervision masks, precisely identifying locations requiring refinement. This approach achieves process-level visual alignment without requiring human-annotated fine-grained supervision. Seamlessly integrable into GRPO-style reinforcement learning frameworks, it establishes the first method to enable process-level multimodal supervision under minimal human annotation, significantly enhancing consistency between reasoning and visual grounding. Evaluated on benchmarks such as MMMStar and MathVista, the method yields an average performance gain of 3% over baselines under identical computational budgets.
📝 Abstract
Vision--language models (VLMs) are increasingly aligned via Group Relative Policy Optimization (GRPO)-style training. However, relying solely on terminal outcome rewards yields sparse credit assignment in multi-step reasoning, weakening the linkage between visual evidence and intermediate steps and often causing unstable optimization and visual hallucinations. We propose Differential Feedback, which automatically constructs token/step-level supervision masks by repairing erroneous reasoning trajectories, explicitly marking the key positions that require correction. Without costly large-scale step-by-step human annotations, our method enables process-level visual alignment and can be seamlessly integrated into existing GRPO-like frameworks. Experiments on multimodal reasoning benchmarks including MMMStar and MathVista show an average 3% improvement under matched compute budgets. Our approach offers an effective, low-cost solution for accurate vision--reasoning process alignment.
Problem

Research questions and friction points this paper is trying to address.

vision-language models
reinforcement learning
credit assignment
visual hallucinations
multimodal reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Differential Feedback
process-level supervision
vision-language models
GRPO
multimodal reasoning
🔎 Similar Papers
No similar papers found.
F
Feiding
Shenzhen International Graduate School, Tsinghua University, China
Y
Yongkang Zhang
Shenzhen International Graduate School, Tsinghua University, China
Y
Yuhao Liao
Shenzhen International Graduate School, Tsinghua University, China
Z
Zijian Zeng
Shenzhen International Graduate School, Tsinghua University, China
C
Chunzheng Zhu
Shenzhen International Graduate School, Tsinghua University, China
Yaozong Zheng
Yaozong Zheng
Guangxi Normal University
Visual TrackingMultimodal Tracking
Yafei Liu
Yafei Liu
Southwest Jiaotong University
RailwayAutomatic Train OperationOptimal ControlModel Predictive Control
Y
Yeling Peng
Shenzhen International Graduate School, Tsinghua University, China
Y
Youwei Wang
Shenzhen International Graduate School, Tsinghua University, China
Sibo Wang
Sibo Wang
The Chinese University of Hong Kong
Databases
H
Huiming Yang
Shenzhen International Graduate School, Tsinghua University, China
L
Linglin Liao
Shenzhen International Graduate School, Tsinghua University, China
S
Shunzhi Yang
Shenzhen International Graduate School, Tsinghua University, China