Value Vision-Language-Action Planning & Search

📅 2026-01-02

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

This work addresses the poor robustness of existing vision-language-action (VLA) models under distribution shifts and their reliance on behavior cloning, which lacks effective long-horizon reward estimation and leads to inefficient planning. To overcome these limitations, the authors propose the first integration of a learnable value function into a VLA planning framework. Specifically, a lightweight MLP is trained on fixed latent representations from an Octo backbone to provide explicit success signals for Monte Carlo Tree Search (MCTS), guiding the search toward high-value regions. This approach moves beyond the constraints of policy priors alone, achieving a performance gain of over 5 percentage points in success rate on the LIBERO benchmark while reducing MCTS simulation counts by 5%–15%, thereby significantly enhancing both planning efficiency and robustness.

Technology Category

Application Category

📝 Abstract

Vision-Language-Action (VLA) models have emerged as powerful generalist policies for robotic manipulation, yet they remain fundamentally limited by their reliance on behavior cloning, leading to brittleness under distribution shift. While augmenting pretrained models with test-time search algorithms like Monte Carlo Tree Search (MCTS) can mitigate these failures, existing formulations rely solely on the VLA prior for guidance, lacking a grounded estimate of expected future return. Consequently, when the prior is inaccurate, the planner can only correct action selection via the exploration term, which requires extensive simulation to become effective. To address this limitation, we introduce Value Vision-Language-Action Planning and Search (V-VLAPS), a framework that augments MCTS with a lightweight, learnable value function. By training a simple multilayer perceptron (MLP) on the latent representations of a fixed VLA backbone (Octo), we provide the search with an explicit success signal that biases action selection toward high-value regions. We evaluate V-VLAPS on the LIBERO robotic manipulation suite, demonstrating that our value-guided search improves success rates by over 5 percentage points while reducing the average number of MCTS simulations by 5-15 percent compared to baselines that rely only on the VLA prior.

Problem

Research questions and friction points this paper is trying to address.

Vision-Language-Action

distribution shift

Monte Carlo Tree Search

value function

robotic manipulation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-Language-Action

Monte Carlo Tree Search

value function