🤖 AI Summary
In instruction-conditioned imitation learning (ICIL), existing action tokenizers encode action trajectories but neglect temporal smoothness, leading to unstable robot execution. This work identifies this critical limitation for the first time and proposes LipVQ-VAE—a novel vector-quantized variational autoencoder incorporating Lipschitz continuity constraints—to explicitly preserve spatiotemporal smoothness of raw actions in the discrete latent space. By jointly optimizing weight normalization and Lipschitz-bounded layers, the model ensures both continuity and differentiability of quantized action representations. Evaluated in high-fidelity simulation, LipVQ-VAE improves task success rate by over 5.3% compared to prior tokenizers. Real-robot experiments further demonstrate significantly smoother generated trajectories and enhanced execution robustness. This work establishes a new paradigm for stable action representation learning in ICIL, bridging the gap between discrete tokenization and continuous control requirements.
📝 Abstract
In-context imitation learning (ICIL) is a new paradigm that enables robots to generalize from demonstrations to unseen tasks without retraining. A well-structured action representation is the key to capturing demonstration information effectively, yet action tokenizer (the process of discretizing and encoding actions) remains largely unexplored in ICIL. In this work, we first systematically evaluate existing action tokenizer methods in ICIL and reveal a critical limitation: while they effectively encode action trajectories, they fail to preserve temporal smoothness, which is crucial for stable robotic execution. To address this, we propose LipVQ-VAE, a variational autoencoder that enforces the Lipschitz condition in the latent action space via weight normalization. By propagating smoothness constraints from raw action inputs to a quantized latent codebook, LipVQ-VAE generates more stable and smoother actions. When integrating into ICIL, LipVQ-VAE improves performance by more than 5.3% in high-fidelity simulators, with real-world experiments confirming its ability to produce smoother, more reliable trajectories. Code and checkpoints will be released.