VReST: Enhancing Reasoning in Large Vision-Language Models through Tree Search and Self-Reward Mechanism

📅 2025-06-10

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

Large vision-language models (LVLMs) exhibit limited efficacy of chain-of-thought prompting on complex visual reasoning tasks. Method: We propose a training-free inference enhancement framework that, for the first time, integrates Monte Carlo Tree Search (MCTS) with a multimodal self-reward mechanism to jointly evaluate subproblem utility, answer correctness, and cross-modal cue relevance—without requiring auxiliary discriminative models. We further incorporate reasoning path modeling and test-time scaling to improve inference depth and robustness. Contribution/Results: Our method achieves state-of-the-art performance on three multimodal mathematical reasoning benchmarks. Crucially, it provides the first empirical validation of the test-time scaling law in multimodal reasoning, demonstrating predictable performance gains with increased inference budget. The framework is fully plug-and-play, requires no fine-tuning or external supervision, and generalizes across diverse LVLM backbones.

Technology Category

Application Category

📝 Abstract

Large Vision-Language Models (LVLMs) have shown exceptional performance in multimodal tasks, but their effectiveness in complex visual reasoning is still constrained, especially when employing Chain-of-Thought prompting techniques. In this paper, we propose VReST, a novel training-free approach that enhances Reasoning in LVLMs through Monte Carlo Tree Search and Self-Reward mechanisms. VReST meticulously traverses the reasoning landscape by establishing a search tree, where each node encapsulates a reasoning step, and each path delineates a comprehensive reasoning sequence. Our innovative multimodal Self-Reward mechanism assesses the quality of reasoning steps by integrating the utility of sub-questions, answer correctness, and the relevance of vision-language clues, all without the need for additional models. VReST surpasses current prompting methods and secures state-of-the-art performance across three multimodal mathematical reasoning benchmarks. Furthermore, it substantiates the efficacy of test-time scaling laws in multimodal tasks, offering a promising direction for future research.

Problem

Research questions and friction points this paper is trying to address.

Enhancing LVLMs' complex visual reasoning capabilities

Overcoming limitations of Chain-of-Thought prompting techniques

Improving multimodal mathematical reasoning benchmark performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Monte Carlo Tree Search for reasoning enhancement

Self-Reward mechanism without additional models

Multimodal reasoning quality assessment integration

🔎 Similar Papers

ViGoR: Improving Visual Grounding of Large Vision Language Models with Fine-Grained Reward Modeling