A1: A Fully Transparent Open-Source, Adaptive and Efficient Truncated Vision-Language-Action Model

📅 2026-04-07

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

This work addresses the high inference latency and computational overhead of existing vision-language-action (VLA) models, which stem from their reliance on large-scale vision-language foundations and iterative action generation, hindering deployment on standard hardware. The authors propose an efficient, open-source VLA framework that jointly optimizes the inference pipeline to drastically reduce computational cost while maintaining high task success rates. Key innovations include a budget-aware adaptive inference mechanism featuring an early-exit strategy based on inter-layer action consistency, cross-layer truncated flow matching, and warm-start denoising. Experiments demonstrate state-of-the-art performance on LIBERO, VLABench, and real-world robots (Franka and AgiBot), achieving up to 72% lower inference latency and a 76.6% reduction in backbone computation. Notably, the method attains a 29.00% success rate on RoboChallenge, surpassing advanced models such as pi0 and X-VLA.

Technology Category

Application Category

📝 Abstract

Vision--Language--Action (VLA) models have emerged as a powerful paradigm for open-world robot manipulation, but their practical deployment is often constrained by \emph{cost}: billion-scale VLM backbones and iterative diffusion/flow-based action heads incur high latency and compute, making real-time control expensive on commodity hardware. We present A1, a fully open-source and transparent VLA framework designed for low-cost, high-throughput inference without sacrificing manipulation success; Our approach leverages pretrained VLMs that provide implicit affordance priors for action generation. We release the full training stack (training code, data/data-processing pipeline, intermediate checkpoints, and evaluation scripts) to enable end-to-end reproducibility. Beyond optimizing the VLM alone, A1 targets the full inference pipeline by introducing a budget-aware adaptive inference scheme that jointly accelerates the backbone and the \emph{action head}. Specifically, we monitor action consistency across intermediate VLM layers to trigger early termination, and propose Inter-Layer Truncated Flow Matching that warm-starts denoising across layers, enabling accurate actions with substantially fewer effective denoising iterations. Across simulation benchmarks (LIBERO, VLABench) and real robots (Franka, AgiBot), A1 achieves state-of-the-art success rates while significantly reducing inference cost (e.g., up to 72% lower per-episode latency for flow-matching inference and up to 76.6% backbone computation reduction with minor performance degradation). On RoboChallenge, A1 achieves an average success rate of 29.00%, outperforming baselines including pi0(28.33%), X-VLA (21.33%), and RDT-1B (15.00%).

Problem

Research questions and friction points this paper is trying to address.

Vision-Language-Action models

real-time control

inference cost

robot manipulation

commodity hardware

Innovation

Methods, ideas, or system contributions that make the work stand out.

adaptive inference

truncated flow matching

vision-language-action model