🤖 AI Summary
This work addresses the scarcity of action labels in world model training by proposing the Latent-Action World Model (LAWM), which jointly leverages a small set of action-labeled interaction trajectories and abundant unlabeled passive observations (e.g., videos). Methodologically, LAWM learns an action-invariant dynamical representation in a latent space, maps explicit actions into this latent action space, and performs cross-modal latent dynamics modeling via action-observation alignment. It innovatively unifies offline reinforcement learning with purely passive data training—marking the first approach to extend world model data sources and generalization capability under extreme action-label scarcity. Evaluated on the DeepMind Control Suite, LAWM achieves performance comparable to fully supervised baselines using only 10% of action annotations, demonstrating substantial improvements in data efficiency.
📝 Abstract
Inspired by how humans combine direct interaction with action-free experience (e.g., videos), we study world models that learn from heterogeneous data. Standard world models typically rely on action-conditioned trajectories, which limits effectiveness when action labels are scarce. We introduce a family of latent-action world models that jointly use action-conditioned and action-free data by learning a shared latent action representation. This latent space aligns observed control signals with actions inferred from passive observations, enabling a single dynamics model to train on large-scale unlabeled trajectories while requiring only a small set of action-labeled ones. We use the latent-action world model to learn a latent-action policy through offline reinforcement learning (RL), thereby bridging two traditionally separate domains: offline RL, which typically relies on action-conditioned data, and action-free training, which is rarely used with subsequent RL. On the DeepMind Control Suite, our approach achieves strong performance while using about an order of magnitude fewer action-labeled samples than purely action-conditioned baselines. These results show that latent actions enable training on both passive and interactive data, which makes world models learn more efficiently.