AR-VLA: True Autoregressive Action Expert for Vision-Language-Action Models

๐Ÿ“… 2026-03-10
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses key limitations in existing vision-language-action (VLA) modelsโ€”namely, the lack of long-term contextual awareness, temporal inconsistency in action sequences, and mismatched inference-control frequencies. To overcome these challenges, the authors propose a standalone autoregressive action expert module that enables context-aware, continuous, and causal action generation through a refreshable vision-language prefix conditioning mechanism and a long-term memory architecture. This approach introduces the first truly autoregressive action policy, featuring a re-anchoring mechanism to synchronize asynchronous multimodal inputs and allowing modular integration with heavyweight perception backbones alongside independent pretraining. Experiments demonstrate that the model produces smoother action trajectories with enhanced historical awareness, achieving task success rates on par with or exceeding those of state-of-the-art reactive VLA models in both simulated and real-world robotic tasks.

Technology Category

Application Category

๐Ÿ“ Abstract
We propose a standalone autoregressive (AR) Action Expert that generates actions as a continuous causal sequence while conditioning on refreshable vision-language prefixes. In contrast to existing Vision-Language-Action (VLA) models and diffusion policies that reset temporal context with each new observation and predict actions reactively, our Action Expert maintains its own history through a long-lived memory and is inherently context-aware. This structure addresses the frequency mismatch between fast control and slow reasoning, enabling efficient independent pretraining of kinematic syntax and modular integration with heavy perception backbones, naturally ensuring spatio-temporally consistent action generation across frames. To synchronize these asynchronous hybrid V-L-A modalities, we utilize a re-anchoring mechanism that mathematically accounts for perception staleness during both training and inference. Experiments on simulated and real-robot manipulation tasks demonstrate that the proposed method can effectively replace traditional chunk-based action heads for both specialist and generalist policies. AR-VLA exhibits superior history awareness and substantially smoother action trajectories while maintaining or exceeding the task success rates of state-of-the-art reactive VLAs. Overall, our work introduces a scalable, context-aware action generation schema that provides a robust structural foundation for training effective robotic policies.
Problem

Research questions and friction points this paper is trying to address.

Vision-Language-Action
autoregressive action generation
temporal context
history awareness
spatio-temporal consistency
Innovation

Methods, ideas, or system contributions that make the work stand out.

autoregressive action generation
context-aware policy
vision-language-action models
temporal consistency
re-anchoring mechanism
๐Ÿ”Ž Similar Papers
2024-03-04Computer Vision and Pattern RecognitionCitations: 3