V-ABS: Action-Observer Driven Beam Search for Dynamic Visual Reasoning

📅 2026-05-11

📈 Citations: 0

✨ Influential: 0

career value

221K/year

🤖 AI Summary

This work addresses the inconsistency between imagined actions and observed feedback in multimodal large language models, which often leads to Imagination-Action-Observation (IAO) bias, undermining reasoning stability and optimality. To mitigate this issue, the authors propose the V-ABS framework, which introduces an action-observer-driven beam search mechanism coupled with an entropy-based adaptive weighting algorithm to dynamically balance policy priors and observational signals. Additionally, they construct a supervised fine-tuning dataset comprising 80,000 samples to guide models toward correct action trajectories. The proposed method achieves state-of-the-art performance across eight benchmarks, yielding an average improvement of 19.7% over the Qwen3-VL-8B baseline and demonstrating consistent effectiveness on both open-source and closed-source models.

📝 Abstract

Multimodal large language models (MLLMs) have achieved remarkable success in general perception, yet complex multi-step visual reasoning remains a persistent challenge. Although recent agentic approaches incorporate tool use, they often neglect critical execution feedback. Consequently, they suffer from the imagination-action-observer (IAO) bias, a misalignment between prior imagination and observer feedback that undermines reasoning stability and optimality. To bridge this gap, we introduce V-ABS, an action-observer driven beam search framework that enables deliberate reasoning through thinker-actor-observer iterations. We also propose an entropy-based adaptive weighting algorithm to mitigate the IAO bias by dynamically balancing the confidence scores between the policy priors and the observational feedback. Moreover, we construct a large-scale supervised fine-tuning (SFT) dataset comprising over 80k samples to guide the model to assign higher prior confidence to correct action paths. Extensive experiments across eight diverse benchmarks show that V-ABS achieves state-of-the-art performance, delivering an average improvement of 19.7% on the Qwen3-VL-8B baseline and consistent gains across both open-source and proprietary models.

Problem

Research questions and friction points this paper is trying to address.

visual reasoning

multimodal large language models

execution feedback

IAO bias

dynamic reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

action-observer feedback

beam search

multimodal reasoning