VLA-Reasoner: Empowering Vision-Language-Action Models with Reasoning via Online Monte Carlo Tree Search

📅 2025-09-26

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

Current vision-language-action (VLA) models suffer from myopic action prediction, leading to error accumulation in long-horizon robotic manipulation. To address this, we propose VLA-Reasoner—a novel framework that integrates a world model, Monte Carlo Tree Search (MCTS), and kernel density estimation (KDE)-based confidence-aware sampling to enable online multi-step future state rollouts and efficient decision-making. Leveraging offline reward shaping and a rolling prediction mechanism, VLA-Reasoner dynamically scales computation at test time without requiring retraining, making it directly compatible with any off-the-shelf VLA model. Evaluated on both simulation and real-world robotic platforms, VLA-Reasoner significantly improves success rates and generalization across long-horizon tasks. Our results empirically validate the effectiveness and practicality of scalable, test-time reasoning for complex robotic manipulation.

Technology Category

Application Category

📝 Abstract

Vision-Language-Action models (VLAs) achieve strong performance in general robotic manipulation tasks by scaling imitation learning. However, existing VLAs are limited to predicting short-sighted next-action, which struggle with long-horizon trajectory tasks due to incremental deviations. To address this problem, we propose a plug-in framework named VLA-Reasoner that effectively empowers off-the-shelf VLAs with the capability of foreseeing future states via test-time scaling. Specifically, VLA-Reasoner samples and rolls out possible action trajectories where involved actions are rationales to generate future states via a world model, which enables VLA-Reasoner to foresee and reason potential outcomes and search for the optimal actions. We further leverage Monte Carlo Tree Search (MCTS) to improve search efficiency in large action spaces, where stepwise VLA predictions seed the root. Meanwhile, we introduce a confidence sampling mechanism based on Kernel Density Estimation (KDE), to enable efficient exploration in MCTS without redundant VLA queries. We evaluate intermediate states in MCTS via an offline reward shaping strategy, to score predicted futures and correct deviations with long-term feedback. We conducted extensive experiments in both simulators and the real world, demonstrating that our proposed VLA-Reasoner achieves significant improvements over the state-of-the-art VLAs. Our method highlights a potential pathway toward scalable test-time computation of robotic manipulation.

Problem

Research questions and friction points this paper is trying to address.

Addresses short-sighted next-action prediction in VLAs

Enables long-horizon reasoning via future state simulation

Improves search efficiency in large robotic action spaces

Innovation

Methods, ideas, or system contributions that make the work stand out.

Online Monte Carlo Tree Search for action planning

Kernel Density Estimation for efficient exploration

Offline reward shaping to evaluate future states

🔎 Similar Papers

No similar papers found.

Toyota Research Institute

Los Altos, CA / Cambridge, MA

Robotics AI Engineer Sr. Staff/Principal Engineer – Embodied AI/Vision Language Action Models

Qualcomm

$221,600.00 - $332,400.00

Santa Clara, California, United States of America / San Diego, California, United States of America

Research Scientist Intern, Robotic Control Policy (PhD)