🤖 AI Summary
This work addresses two key challenges in language-conditioned multi-task imitation learning: policy-induced distributional shift and poor zero-shot transfer to real robots. We propose an end-to-end framework grounded in an offline world model—specifically, a VAE-based dynamics model. Methodologically: (1) policy optimization is performed online in the learned latent space; (2) a multi-step intrinsic reward function is designed over latent trajectories; and (3) a vision-language dual-modality hindsight goal relabeling scheme is introduced, requiring less than 1% language-annotated unstructured play data. To our knowledge, this is the first approach enabling offline world model–driven, language-conditioned, continuous visuomotor control on real robots. Our method significantly outperforms prior approaches on the CALVIN benchmark and successfully executes zero-shot, long-horizon, chained language instructions on physical robotic hardware.
📝 Abstract
We introduce LUMOS, a language-conditioned multi-task imitation learning framework for robotics. LUMOS learns skills by practicing them over many long-horizon rollouts in the latent space of a learned world model and transfers these skills zero-shot to a real robot. By learning on-policy in the latent space of the learned world model, our algorithm mitigates policy-induced distribution shift which most offline imitation learning methods suffer from. LUMOS learns from unstructured play data with fewer than 1% hindsight language annotations but is steerable with language commands at test time. We achieve this coherent long-horizon performance by combining latent planning with both image- and language-based hindsight goal relabeling during training, and by optimizing an intrinsic reward defined in the latent space of the world model over multiple time steps, effectively reducing covariate shift. In experiments on the difficult long-horizon CALVIN benchmark, LUMOS outperforms prior learning-based methods with comparable approaches on chained multi-task evaluations. To the best of our knowledge, we are the first to learn a language-conditioned continuous visuomotor control for a real-world robot within an offline world model. Videos, dataset and code are available at http://lumos.cs.uni-freiburg.de.