Value Vision-Language-Action Planning & Search

📅 2026-01-02
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the poor robustness of existing vision-language-action (VLA) models under distribution shifts and their reliance on behavior cloning, which lacks effective long-horizon reward estimation and leads to inefficient planning. To overcome these limitations, the authors propose the first integration of a learnable value function into a VLA planning framework. Specifically, a lightweight MLP is trained on fixed latent representations from an Octo backbone to provide explicit success signals for Monte Carlo Tree Search (MCTS), guiding the search toward high-value regions. This approach moves beyond the constraints of policy priors alone, achieving a performance gain of over 5 percentage points in success rate on the LIBERO benchmark while reducing MCTS simulation counts by 5%–15%, thereby significantly enhancing both planning efficiency and robustness.

Technology Category

Application Category

📝 Abstract
Vision-Language-Action (VLA) models have emerged as powerful generalist policies for robotic manipulation, yet they remain fundamentally limited by their reliance on behavior cloning, leading to brittleness under distribution shift. While augmenting pretrained models with test-time search algorithms like Monte Carlo Tree Search (MCTS) can mitigate these failures, existing formulations rely solely on the VLA prior for guidance, lacking a grounded estimate of expected future return. Consequently, when the prior is inaccurate, the planner can only correct action selection via the exploration term, which requires extensive simulation to become effective. To address this limitation, we introduce Value Vision-Language-Action Planning and Search (V-VLAPS), a framework that augments MCTS with a lightweight, learnable value function. By training a simple multilayer perceptron (MLP) on the latent representations of a fixed VLA backbone (Octo), we provide the search with an explicit success signal that biases action selection toward high-value regions. We evaluate V-VLAPS on the LIBERO robotic manipulation suite, demonstrating that our value-guided search improves success rates by over 5 percentage points while reducing the average number of MCTS simulations by 5-15 percent compared to baselines that rely only on the VLA prior.
Problem

Research questions and friction points this paper is trying to address.

Vision-Language-Action
distribution shift
Monte Carlo Tree Search
value function
robotic manipulation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-Language-Action
Monte Carlo Tree Search
value function
robotic manipulation
planning and search
🔎 Similar Papers
No similar papers found.
A
Ali Salamatian
The University of British Columbia
K
Ke Ren
The University of British Columbia
K
Kieran Pattison
The University of British Columbia
Cyrus Neary
Cyrus Neary
The University of British Columbia
artificial intelligencereinforcement learningmachine learningmultiagent systemscontrol