Multi-Step Visual Reasoning with Visual Tokens Scaling and Verification

📅 2025-06-08

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

Most multimodal large language models (MLLMs) rely on static visual encoders, resulting in inflexible and context-agnostic visual reasoning during inference. Method: This paper proposes a verifier-guided iterative visual reasoning framework featuring a reasoner-verifier dual-module architecture that enables visual grounding alignment and interpretable reasoning traces. Contribution/Results: First, it introduces the novel mechanism of dynamic visual token expansion, pruning, and termination at inference time. Second, it constructs VTS—the first visual reasoning dataset with supervised reasoning trajectories and preference annotations. Third, it formalizes visual reasoning as a Markov Decision Process (MDP) and trains the verifier via multi-step Direct Preference Optimization (DPO). The framework achieves state-of-the-art performance across multiple visual reasoning benchmarks while enhancing both accuracy and inference transparency.

Technology Category

Application Category

📝 Abstract

Multi-modal large language models (MLLMs) have achieved remarkable capabilities by integrating visual perception with language understanding, enabling applications such as image-grounded dialogue, visual question answering, and scientific analysis. However, most MLLMs adopt a static inference paradigm, encoding the entire image into fixed visual tokens upfront, which limits their ability to iteratively refine understanding or adapt to context during inference. This contrasts sharply with human perception, which is dynamic, selective, and feedback-driven. In this work, we introduce a novel framework for inference-time visual token scaling that enables MLLMs to perform iterative, verifier-guided reasoning over visual content. We formulate the problem as a Markov Decision Process, involving a reasoner that proposes visual actions and a verifier, which is trained via multi-step Direct Preference Optimization (DPO), that evaluates these actions and determines when reasoning should terminate. To support this, we present a new dataset, VTS, comprising supervised reasoning trajectories (VTS-SFT) and preference-labeled reasoning comparisons (VTS-DPO). Our method significantly outperforms existing approaches across diverse visual reasoning benchmarks, offering not only improved accuracy but also more interpretable and grounded reasoning processes. These results demonstrate the promise of dynamic inference mechanisms for enabling fine-grained, context-aware visual reasoning in next-generation MLLMs.

Problem

Research questions and friction points this paper is trying to address.

Enables iterative visual token scaling for dynamic reasoning

Introduces verifier-guided MLLM reasoning via Markov Decision Process

Improves accuracy and interpretability in visual reasoning tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic visual token scaling for iterative refinement

Verifier-guided reasoning with multi-step DPO

Markov Decision Process for visual action proposals

🔎 Similar Papers

No similar papers found.