π€ AI Summary
This work addresses the challenge of visual information decay during long-chain multimodal reasoning, which impairs a modelβs ability to maintain long-term visual dependencies. The study introduces a learnable strategy for determining optimal visual intervention timing, leveraging local branching space and downstream visual propagation potential. It further enhances visual retention through high-entropy reflective anchor points. The proposed method explicitly models the propagation of visual influence by integrating GRPO policy optimization, information entropy estimation, vision-marginalized reference trajectories, and a limited-window contrastive KL divergence. Experiments demonstrate consistent and significant improvements over strong baselines across diverse LVLM backbones and reasoning-intensive benchmarks. Mechanistic analysis confirms that the reflective anchors enrich visually sensitive decision-making and strengthen signals of visual dependency.
π Abstract
Long chain-of-thought (CoT) reasoning improves large vision--language models, but visual information often fades during generation, limiting long-horizon multimodal reasoning. Existing methods either re-inject vision at inference or train policies for stronger grounding, but where to intervene relies on perception heuristics rather than principled gain analysis, and how local visual influence propagates remains implicit. We study this problem from an information-theoretic standpoint and derive a lower bound on the downstream visual gain of a one-step intervention, which suggests two factors: local branching room (token entropy) and downstream visual propagation potential (suffix divergence from a vision-marginalized reference). Guided by this analysis, we propose reflection-anchor policy optimization (RAPO), a GRPO-based policy optimization method that selects high-entropy reflection anchors and optimizes a chain-masked finite-window KL surrogate for downstream visual dependence. Experiments on reasoning-intensive and general-domain benchmarks show that RAPO delivers substantial gains over strong baselines across multiple LVLM backbones. Mechanism analyses further indicate that reflection anchors are enriched for visually sensitive decision points and that RAPO increases contrastive visual-dependence signals along generated trajectories.