Improving Pre-Trained Vision-Language-Action Policies with Model-Based Search

📅 2025-08-16

📈 Citations: 0

✨ Influential: 0

career value

214K/year

🤖 AI Summary

To address the poor robustness and frequent safety failures of pretrained vision-language-action (VLA) models in zero-shot, out-of-distribution robotic tasks, this paper proposes VLAPS: the first framework to embed VLA policies’ action priors into an enhanced Monte Carlo Tree Search (MCTS), enabling efficient, controllable, and environment-aware, language-conditioned reasoning and planning. VLAPS supports test-time computational resource allocation, explicit environmental modeling, and co-optimization of planning with reinforcement learning. Evaluated on standard language-instructed robotic tasks, VLAPS significantly outperforms pure VLA baselines—achieving up to a 67-percentage-point improvement in task success rate. It effectively solves complex, long-horizon, and safety-critical tasks that are intractable for undirected search methods.

Technology Category

Application Category

📝 Abstract

Pre-trained vision-language-action (VLA) models offer a promising foundation for generalist robot policies, but often produce brittle behaviours or unsafe failures when deployed zero-shot in out-of-distribution scenarios. We present Vision-Language-Action Planning & Search (VLAPS) -- a novel framework and accompanying algorithms that embed model-based search into the inference procedure of pre-trained VLA policies to improve their performance on robotic tasks. Specifically, our method biases a modified Monte Carlo Tree Search (MCTS) algorithm -- run using a model of the target environment -- using action priors defined by the VLA policy. By using VLA-derived abstractions and priors in model-based search, VLAPS efficiently explores language-conditioned robotics tasks whose search spaces would otherwise be intractably large. Conversely, by integrating model-based search with the VLA policy's inference procedure, VLAPS yields behaviours that are more performant than those obtained by directly following the VLA policy's action predictions. VLAPS offers a principled framework to: i) control test-time compute in VLA models, ii) leverage a priori knowledge of the robotic environment, and iii) integrate established planning and reinforcement learning techniques into the VLA inference process. Across all experiments, VLAPS significantly outperforms VLA-only baselines on language-specified tasks that would otherwise be intractable for uninformed search algorithms, increasing success rates by as much as 67 percentage points.

Problem

Research questions and friction points this paper is trying to address.

Enhance pre-trained VLA policies for robust robot behaviors

Integrate model-based search to improve VLA policy performance

Address large search spaces in language-conditioned robotics tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Model-based search enhances VLA policy inference

Modified MCTS with VLA action priors

VLAPS integrates planning and reinforcement learning

🔎 Similar Papers

No similar papers found.