🤖 AI Summary
This work addresses the challenge of sparse credit assignment in multi-step visual reasoning, where reliance solely on terminal rewards weakens the association between visual evidence and intermediate reasoning steps, leading to optimization instability and visual hallucinations. To mitigate this, the authors propose a differential feedback mechanism that automatically corrects erroneous reasoning trajectories by generating token- or step-level supervision masks, precisely identifying locations requiring refinement. This approach achieves process-level visual alignment without requiring human-annotated fine-grained supervision. Seamlessly integrable into GRPO-style reinforcement learning frameworks, it establishes the first method to enable process-level multimodal supervision under minimal human annotation, significantly enhancing consistency between reasoning and visual grounding. Evaluated on benchmarks such as MMMStar and MathVista, the method yields an average performance gain of 3% over baselines under identical computational budgets.
📝 Abstract
Vision--language models (VLMs) are increasingly aligned via Group Relative Policy Optimization (GRPO)-style training. However, relying solely on terminal outcome rewards yields sparse credit assignment in multi-step reasoning, weakening the linkage between visual evidence and intermediate steps and often causing unstable optimization and visual hallucinations. We propose Differential Feedback, which automatically constructs token/step-level supervision masks by repairing erroneous reasoning trajectories, explicitly marking the key positions that require correction. Without costly large-scale step-by-step human annotations, our method enables process-level visual alignment and can be seamlessly integrated into existing GRPO-like frameworks. Experiments on multimodal reasoning benchmarks including MMMStar and MathVista show an average 3% improvement under matched compute budgets. Our approach offers an effective, low-cost solution for accurate vision--reasoning process alignment.