๐ค AI Summary
This work addresses key limitations in existing vision-language-action (VLA) modelsโnamely, the lack of long-term contextual awareness, temporal inconsistency in action sequences, and mismatched inference-control frequencies. To overcome these challenges, the authors propose a standalone autoregressive action expert module that enables context-aware, continuous, and causal action generation through a refreshable vision-language prefix conditioning mechanism and a long-term memory architecture. This approach introduces the first truly autoregressive action policy, featuring a re-anchoring mechanism to synchronize asynchronous multimodal inputs and allowing modular integration with heavyweight perception backbones alongside independent pretraining. Experiments demonstrate that the model produces smoother action trajectories with enhanced historical awareness, achieving task success rates on par with or exceeding those of state-of-the-art reactive VLA models in both simulated and real-world robotic tasks.
๐ Abstract
We propose a standalone autoregressive (AR) Action Expert that generates actions as a continuous causal sequence while conditioning on refreshable vision-language prefixes. In contrast to existing Vision-Language-Action (VLA) models and diffusion policies that reset temporal context with each new observation and predict actions reactively, our Action Expert maintains its own history through a long-lived memory and is inherently context-aware. This structure addresses the frequency mismatch between fast control and slow reasoning, enabling efficient independent pretraining of kinematic syntax and modular integration with heavy perception backbones, naturally ensuring spatio-temporally consistent action generation across frames. To synchronize these asynchronous hybrid V-L-A modalities, we utilize a re-anchoring mechanism that mathematically accounts for perception staleness during both training and inference. Experiments on simulated and real-robot manipulation tasks demonstrate that the proposed method can effectively replace traditional chunk-based action heads for both specialist and generalist policies. AR-VLA exhibits superior history awareness and substantially smoother action trajectories while maintaining or exceeding the task success rates of state-of-the-art reactive VLAs. Overall, our work introduces a scalable, context-aware action generation schema that provides a robust structural foundation for training effective robotic policies.