๐ค AI Summary
Existing two-stage approaches separately train latent action models (LAMs) and world models, leading to representation inconsistency, redundant optimization, and limited co-adaptation. To address this, we propose CoLA-Worldโthe first framework enabling end-to-end joint training of LAMs with pre-trained world models. CoLA-World introduces a warm-up phase to prevent representation collapse and incorporates representation alignment and progressive optimization mechanisms to establish a bidirectional, mutually reinforcing co-evolutionary loop. By unifying video generation modeling with dynamical system modeling, our method significantly enhances the world modelโs generality and out-of-distribution generalization. Experiments demonstrate that CoLA-World surpasses conventional two-stage paradigms in both video simulation fidelity and downstream visual planning performance, validating the effectiveness and robustness of the co-evolutionary modeling paradigm.
๐ Abstract
Adapting pre-trained video generation models into controllable world models via latent actions is a promising step towards creating generalist world models. The dominant paradigm adopts a two-stage approach that trains latent action model (LAM) and the world model separately, resulting in redundant training and limiting their potential for co-adaptation. A conceptually simple and appealing idea is to directly replace the forward dynamic model in LAM with a powerful world model and training them jointly, but it is non-trivial and prone to representational collapse. In this work, we propose CoLA-World, which for the first time successfully realizes this synergistic paradigm, resolving the core challenge in joint learning through a critical warm-up phase that effectively aligns the representations of the from-scratch LAM with the pre-trained world model. This unlocks a co-evolution cycle: the world model acts as a knowledgeable tutor, providing gradients to shape a high-quality LAM, while the LAM offers a more precise and adaptable control interface to the world model. Empirically, CoLA-World matches or outperforms prior two-stage methods in both video simulation quality and downstream visual planning, establishing a robust and efficient new paradigm for the field.