V-ABS: Action-Observer Driven Beam Search for Dynamic Visual Reasoning

📅 2026-05-11
📈 Citations: 0
Influential: 0
📄 PDF

career value

204K/year
🤖 AI Summary
This work addresses the inconsistency between imagined actions and observed feedback in multimodal large language models, which often leads to Imagination-Action-Observation (IAO) bias, undermining reasoning stability and optimality. To mitigate this issue, the authors propose the V-ABS framework, which introduces an action-observer-driven beam search mechanism coupled with an entropy-based adaptive weighting algorithm to dynamically balance policy priors and observational signals. Additionally, they construct a supervised fine-tuning dataset comprising 80,000 samples to guide models toward correct action trajectories. The proposed method achieves state-of-the-art performance across eight benchmarks, yielding an average improvement of 19.7% over the Qwen3-VL-8B baseline and demonstrating consistent effectiveness on both open-source and closed-source models.
📝 Abstract
Multimodal large language models (MLLMs) have achieved remarkable success in general perception, yet complex multi-step visual reasoning remains a persistent challenge. Although recent agentic approaches incorporate tool use, they often neglect critical execution feedback. Consequently, they suffer from the imagination-action-observer (IAO) bias, a misalignment between prior imagination and observer feedback that undermines reasoning stability and optimality. To bridge this gap, we introduce V-ABS, an action-observer driven beam search framework that enables deliberate reasoning through thinker-actor-observer iterations. We also propose an entropy-based adaptive weighting algorithm to mitigate the IAO bias by dynamically balancing the confidence scores between the policy priors and the observational feedback. Moreover, we construct a large-scale supervised fine-tuning (SFT) dataset comprising over 80k samples to guide the model to assign higher prior confidence to correct action paths. Extensive experiments across eight diverse benchmarks show that V-ABS achieves state-of-the-art performance, delivering an average improvement of 19.7% on the Qwen3-VL-8B baseline and consistent gains across both open-source and proprietary models.
Problem

Research questions and friction points this paper is trying to address.

visual reasoning
multimodal large language models
execution feedback
IAO bias
dynamic reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

action-observer feedback
beam search
multimodal reasoning
adaptive weighting
IAO bias
Z
Zhiwei Ning
School of Automation and Intelligent Sensing, Shanghai Jiao Tong University; Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University; SenseTime Research
X
Xuanang Gao
School of Automation and Intelligent Sensing, Shanghai Jiao Tong University; Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University
J
Jiaxi Cao
School of Automation and Intelligent Sensing, Shanghai Jiao Tong University; Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University
G
Gengming Zhang
School of Automation and Intelligent Sensing, Shanghai Jiao Tong University; Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University
S
Shengnan Ma
SenseTime Research
Wenwen Tong
Wenwen Tong
Peking University
Fluid MechanicsDeep Learning
Hanming Deng
Hanming Deng
Unknown affiliation
Deep LearningObject Detection
Jie Yang
Jie Yang
Shanghai Jiao Tong University
Image ProcessingMedical Image ProcessingPattern Recognition
W
Wei Liu
School of Automation and Intelligent Sensing, Shanghai Jiao Tong University; Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University; Institute of Medical Robotics, Shanghai Jiao Tong University