Aha Moment Revisited: Are VLMs Truly Capable of Self Verification in Inference-time Scaling?

📅 2025-06-20

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

This work investigates whether reinforcement learning–fine-tuned vision-language models (RL-VLMs) possess reliable cross-modal (vision + text) self-verification capability during inference-time scaling. We propose a unified inference-time compute scaling framework and systematically evaluate decoding scaling, majority voting, best-of-N sampling, and self-verification prompting across multiple visual reasoning benchmarks. Our experiments reveal, for the first time, fundamental limitations in current RL-VLMs’ self-verification: “aha moments” do not yield significant performance gains, and generation-dominated strategies (e.g., majority voting) consistently outperform verification-dominated ones (e.g., best-of-N) by 12.3% on average (relative improvement). The core contribution is the empirical establishment of a “generation-over-verification” principle, demonstrating that self-verification is ineffective as an error-correction mechanism for contemporary RL-VLMs. These findings provide critical empirical evidence and methodological guidance for inference-time scaling paradigms in multimodal reasoning.

Technology Category

Application Category

📝 Abstract

Recent advances in large language models (LLMs) have demonstrated that inference-time computation techniques, such as decoding-time scaling and self-refinement, can significantly enhance reasoning capabilities without relying on external knowledge. A key driver of this success is the emergence of self-correction and self-verification behaviors, often elicited through reinforcement learning (RL). In this paper, we investigate whether these inference-time techniques extend effectively to vision-language models (VLMs), particularly those trained with RL. We find that while decoding strategies such as majority voting and best-of-N selection with self-verification all improve VLM reasoning performance, generation-reliant methods such as the former achieve significantly higher gains versus verification-reliant methods such as the latter. Additionally, the self-correction behavior often associated with RL-tuned models, such as aha moment, does not lead to measurable gains. We show via extensive experimentation within the inference-time scaling framework to identify a key root cause: RL-trained VLMs still lack robust self-verification capabilities across both visual and textual modalities.

Problem

Research questions and friction points this paper is trying to address.

Assess VLMs' self-verification in inference-time scaling

Compare decoding strategies for VLM reasoning improvement

Identify RL-trained VLMs' weak self-verification capabilities

Innovation

Methods, ideas, or system contributions that make the work stand out.

VLMs use self-verification for reasoning improvement

Decoding-time scaling enhances VLM performance

RL-trained VLMs lack robust self-verification

🔎 Similar Papers

No similar papers found.