Co-Evolving Latent Action World Models

๐Ÿ“… 2025-10-30
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing two-stage approaches separately train latent action models (LAMs) and world models, leading to representation inconsistency, redundant optimization, and limited co-adaptation. To address this, we propose CoLA-Worldโ€”the first framework enabling end-to-end joint training of LAMs with pre-trained world models. CoLA-World introduces a warm-up phase to prevent representation collapse and incorporates representation alignment and progressive optimization mechanisms to establish a bidirectional, mutually reinforcing co-evolutionary loop. By unifying video generation modeling with dynamical system modeling, our method significantly enhances the world modelโ€™s generality and out-of-distribution generalization. Experiments demonstrate that CoLA-World surpasses conventional two-stage paradigms in both video simulation fidelity and downstream visual planning performance, validating the effectiveness and robustness of the co-evolutionary modeling paradigm.

Technology Category

Application Category

๐Ÿ“ Abstract
Adapting pre-trained video generation models into controllable world models via latent actions is a promising step towards creating generalist world models. The dominant paradigm adopts a two-stage approach that trains latent action model (LAM) and the world model separately, resulting in redundant training and limiting their potential for co-adaptation. A conceptually simple and appealing idea is to directly replace the forward dynamic model in LAM with a powerful world model and training them jointly, but it is non-trivial and prone to representational collapse. In this work, we propose CoLA-World, which for the first time successfully realizes this synergistic paradigm, resolving the core challenge in joint learning through a critical warm-up phase that effectively aligns the representations of the from-scratch LAM with the pre-trained world model. This unlocks a co-evolution cycle: the world model acts as a knowledgeable tutor, providing gradients to shape a high-quality LAM, while the LAM offers a more precise and adaptable control interface to the world model. Empirically, CoLA-World matches or outperforms prior two-stage methods in both video simulation quality and downstream visual planning, establishing a robust and efficient new paradigm for the field.
Problem

Research questions and friction points this paper is trying to address.

Training latent action and world models separately causes redundant training
Joint training risks representational collapse without proper alignment
Aligning representations enables synergistic co-evolution between models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Co-evolves latent action models with world models
Uses warm-up phase to align representations
Enables joint training to avoid redundant stages
๐Ÿ”Ž Similar Papers
No similar papers found.
Yucen Wang
Yucen Wang
Ph.D student at Nanjing University
Reinforcement LearningWorld Models
F
Fengming Zhang
Nanjing University
De-Chuan Zhan
De-Chuan Zhan
Nanjing University, China
Machine LearningData Mining
L
Li Zhao
Microsoft Research Asia
K
Kaixin Wang
Microsoft Research Asia
J
Jiang Bian
Microsoft Research Asia