π€ AI Summary
To address the scarcity and high cost of expert action labels in imitation learning, this paper proposes a continuous latent action representation learning framework that requires no action annotations. Methodologically, it employs a variational autoencoder to perform unsupervised latent space modeling over unlabeled demonstration videos and introduces an end-to-end jointly optimized action decoder to automatically infer latent action sequences. Furthermore, a policy distillation mechanism is incorporated to reliably ground the learned latent space to physical actionsβeven with only a few (or zero) real action labels. Experiments on DMControl, MetaWorld, and a real-world WidowX robotic arm demonstrate that our approach significantly outperforms state-of-the-art methods, achieving 2β3Γ higher task success rates. These results validate its strong generalization capability and practical efficacy under zero-label conditions.
π Abstract
Learning robot policies using imitation learning requires collecting large amounts of costly action-labeled expert demonstrations, which fundamentally limits the scale of training data. A promising approach to address this bottleneck is to harness the abundance of unlabeled observations-e.g., from video demonstrations-to learn latent action labels in an unsupervised way. However, we find that existing methods struggle when applied to complex robot tasks requiring fine-grained motions. We design continuous latent action models (CLAM) which incorporate two key ingredients we find necessary for learning to solve complex continuous control tasks from unlabeled observation data: (a) using continuous latent action labels instead of discrete representations, and (b) jointly training an action decoder to ensure that the latent action space can be easily grounded to real actions with relatively few labeled examples. Importantly, the labeled examples can be collected from non-optimal play data, enabling CLAM to learn performant policies without access to any action-labeled expert data. We demonstrate on continuous control benchmarks in DMControl (locomotion) and MetaWorld (manipulation), as well as on a real WidowX robot arm that CLAM significantly outperforms prior state-of-the-art methods, remarkably with a 2-3x improvement in task success rate compared to the best baseline. Videos and code can be found at clamrobot.github.io.