Steering Vision-Language-Action Models as Anti-Exploration: A Test-Time Scaling Approach

📅 2025-12-02

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

VLA models suffer from distribution mismatch between pretraining and downstream fine-tuning data, leading to unstable inference after supervised fine-tuning—redundant action patterns induce policy drift. To address this, we propose TACO, a test-time expansion framework that introduces an anti-exploration mechanism into VLA inference without gradient updates. TACO employs a lightweight pseudo-count estimator to score multiple candidate action chunks generated by flow-matching or diffusion models, selecting only high-confidence, low-redundancy actions for execution. By explicitly suppressing suboptimal kinematic modes at inference time, TACO mitigates distribution shift without increasing training overhead. Experiments demonstrate that TACO significantly improves task success rates and inference stability across four simulation benchmarks and a real-world dual-arm robotic platform, outperforming existing methods in generalization and robustness.

Technology Category

Application Category

📝 Abstract

Vision-Language-Action (VLA) models, trained via flow-matching or diffusion objectives, excel at learning complex behaviors from large-scale, multi-modal datasets (e.g., human teleoperation, scripted policies). However, since VLAs incorporate diverse data modes in the pre-training stage, and the finetuning dataset often contains demonstration data collected in a kinematically suboptimal or undesirable way, it exists redundant action modes that are irrelevant to the success action modes of the downstream task. Specifically, we observe a critical inference-time fragility among various sampled noises after supervised finetuning of pre-trained VLAs. In this paper, we attribute this instability to the distribution shift between the VLA policy and the policy induced by stable success modes of the downstream task dataset. Thus, we propose extbf{TACO}, a test-time-scaling (TTS) framework that applies a lightweight pseudo-count estimator as a high-fidelity verifier of action chunks. The VLA models integrated with TACO can execute the actions with maximum pseudo-count from all sampled action chunks, thereby preventing distribution shifts while preserving the generalization ability of VLAs since the constraint is applied only during inference. Our method resembles the classical anti-exploration principle in offline reinforcement learning (RL), and being gradient-free, it incurs significant computational benefits compared to RL update, especially for flow or diffusion-based VLAs which are difficult to perform RL update due to denoising process. Extensive experiments across four simulation benchmarks (RoboTwin2.0, Robotwin, LIBERO, SimplerEnv) and a dual-arm platform demonstrate that our method significantly improves the inference stability and success rates in downstream-task adaptations.

Problem

Research questions and friction points this paper is trying to address.

Addresses inference instability in Vision-Language-Action models

Mitigates distribution shift between policy and downstream task modes

Enhances test-time action selection via pseudo-count verification

Innovation

Methods, ideas, or system contributions that make the work stand out.

Test-time scaling with pseudo-count estimator

Gradient-free anti-exploration for VLA models

Lightweight verifier prevents distribution shift

🔎 Similar Papers

VLATest: Testing and Evaluating Vision-Language-Action Models for Robotic Manipulation