🤖 AI Summary
This work addresses the common failure of multimodal large language models to faithfully leverage task-relevant visual evidence in complex reasoning due to a disconnect between perception and reasoning. To bridge this gap, the authors propose Faithful-MR1, a two-stage training framework for faithful multimodal reasoning. In the anchoring stage, perception is treated as a prerequisite subtask, with direct supervision applied to <Focus> tokens to align their attention with critical image regions. In the reinforcement stage, counterfactual image interventions are introduced alongside a verifiable reward mechanism to explicitly encourage the model to attend to causally relevant regions and produce correct answers. Notably, this approach is the first to apply perceptual supervision directly to image regions rather than textual descriptions, achieving state-of-the-art performance on Qwen2.5-VL-Instruct 3B/7B with significantly less data.
📝 Abstract
Reinforcement learning with verifiable rewards (RLVR) has emerged as a promising paradigm for advancing complex reasoning in large language models, and recent work extends RLVR to multimodal large language models (MLLMs). This transfer, however, surfaces a faithfulness challenge: faithful perception of task-relevant visual evidence and faithful use of that evidence during reasoning, leading to unsatisfactory gains on multimodal benchmarks. Specifically, existing perception supervision often operates on textual descriptions rather than natively on image regions, and faithful use is largely overlooked, exposing the perception-reasoning disconnect where correctly perceived evidence is dropped or contradicted during reasoning. To close these gaps, we propose Faithful-MR1, a training framework that anchors and reinforces visual attention to address both halves of faithful multimodal reasoning. The Anchoring stage turns perception into an explicit pre-reasoning subtask, supervising a dedicated <Focus> token's attention directly against image regions rather than through textual descriptions. The Reinforcing stage exposes faithful use through counterfactual image intervention, rewarding answer-correct trajectories that concentrate visual attention where vision causally matters. Extensive experiments demonstrate that Faithful-MR1 outperforms recent multimodal reasoning baselines on both Qwen2.5-VL-Instruct 3B and 7B backbones while using substantially less training data.