🤖 AI Summary
This work addresses the “blind reasoning” problem in multimodal reinforcement learning, where models often rely excessively on linguistic priors and neglect visual inputs. To mitigate this, the authors propose a Differential Visual Reasoning Policy (DVRP) that leverages a visual triplet—comprising original, masked, and perturbed images—to construct an intrinsic reward mechanism based on discrepancies in visual information. This approach guides the agent to align its reasoning strictly with observable visual changes, without requiring external annotations or auxiliary tools. DVRP significantly enhances the model’s visual sensitivity and robustness, outperforming existing methods on both general and medical multimodal benchmarks. By explicitly promoting reliance on and comprehension of visual evidence, the method effectively alleviates blind reasoning and strengthens grounding in perceptual inputs.
📝 Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) has significantly advanced reasoning capabilities in Large Language Models. However, adapting RLVR to multimodal domains suffers from a critical \textit{perception-reasoning decoupling}. Existing paradigms, driven by text-centric outcome rewards, reasoning in language medium, inadvertently encourage models to bypass visual perception. We empirically validate this through blind experiments: state-of-the-art policies maintain or surprisingly improve performance even when visual inputs are entirely removed. This reveals that these models degenerate into \textit{blind reasoners}, exploiting linguistic priors to generate plausible answers instead of attending to visual evidence. In response, we propose \textbf{Thinking with Deltas}, a framework driven by a \textbf{Differential Visual Reasoning Policy (DVRP)}. DVRP introduces intrinsic supervision via visual triplets, comprising original, masked, and perturbed inputs. It optimizes the model to maximize reasoning divergence from masked inputs (enforcing \textit{visual sensitivity}) while minimizing divergence from perturbed inputs (ensuring \textit{visual robustness}). By aligning reasoning variations strictly with the \textit{Delta} of visual information, DVRP inherently bolsters visual understanding capabilities and significantly outperforms state-of-the-art methods on both general and medical benchmarks, without requiring external annotations or auxiliary tools.