IVR-R1: Refining Trajectories through Iterative Visual-Grounded Reasoning in Reinforcement Learning

📅 2026-05-18

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

This work addresses the challenge of visual hallucination and logical inconsistency in multimodal reinforcement learning for long-horizon visual reasoning, which often arises from asymmetry between textual and visual information. To mitigate this, the authors propose an iterative visually grounded reasoning framework that incorporates a reward-driven trajectory filtering mechanism to detect errors, followed by step-level fine-grained error attribution and dynamic visual realignment. This enables an automatic correction loop—termed Re-Reasoning Loop—that iteratively refines reasoning trajectories to produce high-fidelity, expert-level reasoning templates. The approach significantly enhances both visual grounding and logical coherence, consistently outperforming existing reinforcement learning methods across multiple multimodal benchmarks and establishing a new paradigm for consistency-preserving complex multimodal reasoning.

📝 Abstract

Multimodal large language models via reinforcement learning (RL) have demonstrated remarkable capabilities in complex visual reasoning tasks, yet they remain limited in long-horizon multimodal scenarios, often suffering from visual hallucination and logical error. Current methods typically pre-encode high-dimensional visual scenes into discrete textual proxies to facilitate downstream reasoning. As the reasoning chain unfolds, however, the inherent information asymmetry between text and visual scenes tends to erode visual grounding, resulting in misguided reasoning and erroneous outputs. To address this issue, we introduce IVR-R1 (Iterative Visual-grounded Reasoning), a novel RL training framework that facilitates dynamic visual re-alignment that actively rectifies reasoning trajectories to guide policy optimization. Specifically, by leveraging a reward-driven screening mechanism to identify flawed rollouts, IVR-R1 executes a fine-grained, step-level error attribution within the multimodal context. By iteratively cross-referencing intermediate reasoning states against pristine visual priors, a Re-Reasoning Loop enables automated trajectory rectification, effectively synthesizing expert-level demonstrations that serve as high-fidelity reasoning templates for the policy model. Our experiments across diverse multimodal benchmarks demonstrate that IVR-R1 consistently outperforms existing reinforcement learning methods, establishing a superior paradigm for maintaining logical and visual consistency in complex multimodal reasoning.

Problem

Research questions and friction points this paper is trying to address.

visual hallucination

logical error

information asymmetry

visual grounding

multimodal reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Iterative Visual-grounded Reasoning

Reinforcement Learning

Multimodal Reasoning