🤖 AI Summary
Existing Transformer-based imitation learning methods rely on action discretization (e.g., vector quantization), which distorts the continuous geometric structure of action spaces and impairs model expressivity and policy smoothness. To address this, we propose Generative Infinite-Vocabulary Transformer (GIVT), the first continuous autoregressive policy modeling framework that operates directly in the original continuous action space—eliminating the need for action quantization. GIVT introduces a continuous-domain-adapted autoregressive modeling mechanism and an enhanced sampling strategy combining temperature scaling with importance reweighting, thereby preserving the topological properties and kinematic constraints of the action distribution. Evaluated on multiple standard simulated robotic manipulation tasks, GIVT significantly outperforms quantized baselines, achieving state-of-the-art performance while improving policy generalization and temporal smoothness of generated actions.
📝 Abstract
Current transformer-based imitation learning approaches introduce discrete action representations and train an autoregressive transformer decoder on the resulting latent code. However, the initial quantization breaks the continuous structure of the action space thereby limiting the capabilities of the generative model. We propose a quantization-free method instead that leverages Generative Infinite-Vocabulary Transformers (GIVT) as a direct, continuous policy parametrization for autoregressive transformers. This simplifies the imitation learning pipeline while achieving state-of-the-art performance on a variety of popular simulated robotics tasks. We enhance our policy roll-outs by carefully studying sampling algorithms, further improving the results.