LUMOS: Language-Conditioned Imitation Learning with World Models

📅 2025-03-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses two key challenges in language-conditioned multi-task imitation learning: policy-induced distributional shift and poor zero-shot transfer to real robots. We propose an end-to-end framework grounded in an offline world model—specifically, a VAE-based dynamics model. Methodologically: (1) policy optimization is performed online in the learned latent space; (2) a multi-step intrinsic reward function is designed over latent trajectories; and (3) a vision-language dual-modality hindsight goal relabeling scheme is introduced, requiring less than 1% language-annotated unstructured play data. To our knowledge, this is the first approach enabling offline world model–driven, language-conditioned, continuous visuomotor control on real robots. Our method significantly outperforms prior approaches on the CALVIN benchmark and successfully executes zero-shot, long-horizon, chained language instructions on physical robotic hardware.

Technology Category

Application Category

📝 Abstract
We introduce LUMOS, a language-conditioned multi-task imitation learning framework for robotics. LUMOS learns skills by practicing them over many long-horizon rollouts in the latent space of a learned world model and transfers these skills zero-shot to a real robot. By learning on-policy in the latent space of the learned world model, our algorithm mitigates policy-induced distribution shift which most offline imitation learning methods suffer from. LUMOS learns from unstructured play data with fewer than 1% hindsight language annotations but is steerable with language commands at test time. We achieve this coherent long-horizon performance by combining latent planning with both image- and language-based hindsight goal relabeling during training, and by optimizing an intrinsic reward defined in the latent space of the world model over multiple time steps, effectively reducing covariate shift. In experiments on the difficult long-horizon CALVIN benchmark, LUMOS outperforms prior learning-based methods with comparable approaches on chained multi-task evaluations. To the best of our knowledge, we are the first to learn a language-conditioned continuous visuomotor control for a real-world robot within an offline world model. Videos, dataset and code are available at http://lumos.cs.uni-freiburg.de.
Problem

Research questions and friction points this paper is trying to address.

Develops language-conditioned imitation learning for robotics.
Mitigates policy-induced distribution shift in offline learning.
Enables zero-shot skill transfer to real robots.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Language-conditioned multi-task imitation learning
Latent space planning with world models
Zero-shot skill transfer to real robots
🔎 Similar Papers
No similar papers found.