Difference Feedback: Generating Multimodal Process-Level Supervision for VLM Reinforcement Learning

📅 2026-03-28

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

This work addresses the challenge of sparse credit assignment in multi-step visual reasoning, where reliance solely on terminal rewards weakens the association between visual evidence and intermediate reasoning steps, leading to optimization instability and visual hallucinations. To mitigate this, the authors propose a differential feedback mechanism that automatically corrects erroneous reasoning trajectories by generating token- or step-level supervision masks, precisely identifying locations requiring refinement. This approach achieves process-level visual alignment without requiring human-annotated fine-grained supervision. Seamlessly integrable into GRPO-style reinforcement learning frameworks, it establishes the first method to enable process-level multimodal supervision under minimal human annotation, significantly enhancing consistency between reasoning and visual grounding. Evaluated on benchmarks such as MMMStar and MathVista, the method yields an average performance gain of 3% over baselines under identical computational budgets.

Technology Category

Application Category

📝 Abstract

Vision--language models (VLMs) are increasingly aligned via Group Relative Policy Optimization (GRPO)-style training. However, relying solely on terminal outcome rewards yields sparse credit assignment in multi-step reasoning, weakening the linkage between visual evidence and intermediate steps and often causing unstable optimization and visual hallucinations. We propose Differential Feedback, which automatically constructs token/step-level supervision masks by repairing erroneous reasoning trajectories, explicitly marking the key positions that require correction. Without costly large-scale step-by-step human annotations, our method enables process-level visual alignment and can be seamlessly integrated into existing GRPO-like frameworks. Experiments on multimodal reasoning benchmarks including MMMStar and MathVista show an average 3% improvement under matched compute budgets. Our approach offers an effective, low-cost solution for accurate vision--reasoning process alignment.

Problem

Research questions and friction points this paper is trying to address.

vision-language models

reinforcement learning

credit assignment

visual hallucinations

multimodal reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Differential Feedback

process-level supervision

vision-language models