VLA-Reasoner: Empowering Vision-Language-Action Models with Reasoning via Online Monte Carlo Tree Search

πŸ“… 2025-09-26
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Current vision-language-action (VLA) models suffer from myopic action prediction, leading to error accumulation in long-horizon robotic manipulation. To address this, we propose VLA-Reasonerβ€”a novel framework that integrates a world model, Monte Carlo Tree Search (MCTS), and kernel density estimation (KDE)-based confidence-aware sampling to enable online multi-step future state rollouts and efficient decision-making. Leveraging offline reward shaping and a rolling prediction mechanism, VLA-Reasoner dynamically scales computation at test time without requiring retraining, making it directly compatible with any off-the-shelf VLA model. Evaluated on both simulation and real-world robotic platforms, VLA-Reasoner significantly improves success rates and generalization across long-horizon tasks. Our results empirically validate the effectiveness and practicality of scalable, test-time reasoning for complex robotic manipulation.

Technology Category

Application Category

πŸ“ Abstract
Vision-Language-Action models (VLAs) achieve strong performance in general robotic manipulation tasks by scaling imitation learning. However, existing VLAs are limited to predicting short-sighted next-action, which struggle with long-horizon trajectory tasks due to incremental deviations. To address this problem, we propose a plug-in framework named VLA-Reasoner that effectively empowers off-the-shelf VLAs with the capability of foreseeing future states via test-time scaling. Specifically, VLA-Reasoner samples and rolls out possible action trajectories where involved actions are rationales to generate future states via a world model, which enables VLA-Reasoner to foresee and reason potential outcomes and search for the optimal actions. We further leverage Monte Carlo Tree Search (MCTS) to improve search efficiency in large action spaces, where stepwise VLA predictions seed the root. Meanwhile, we introduce a confidence sampling mechanism based on Kernel Density Estimation (KDE), to enable efficient exploration in MCTS without redundant VLA queries. We evaluate intermediate states in MCTS via an offline reward shaping strategy, to score predicted futures and correct deviations with long-term feedback. We conducted extensive experiments in both simulators and the real world, demonstrating that our proposed VLA-Reasoner achieves significant improvements over the state-of-the-art VLAs. Our method highlights a potential pathway toward scalable test-time computation of robotic manipulation.
Problem

Research questions and friction points this paper is trying to address.

Addresses short-sighted next-action prediction in VLAs
Enables long-horizon reasoning via future state simulation
Improves search efficiency in large robotic action spaces
Innovation

Methods, ideas, or system contributions that make the work stand out.

Online Monte Carlo Tree Search for action planning
Kernel Density Estimation for efficient exploration
Offline reward shaping to evaluate future states
πŸ”Ž Similar Papers
No similar papers found.
W
Wenkai Guo
School of Electrical and Electronic Engineering, Nanyang Technological University
Guanxing Lu
Guanxing Lu
Tsinghua University
VLARLRobotics3D Vision
Haoyuan Deng
Haoyuan Deng
Nanyang Technological University
RoboticsImitation LearningReinforcement Learning
Z
Zhenyu Wu
School of Intelligent Engineering and Automation, Beijing University of Posts and Telecommunications
Y
Yansong Tang
Tsinghua Shenzhen International Graduate School, Tsinghua University
Z
Ziwei Wang
School of Electrical and Electronic Engineering, Nanyang Technological University