Co-Evolving Latent Action World Models

📅 2025-10-30

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

Existing two-stage approaches separately train latent action models (LAMs) and world models, leading to representation inconsistency, redundant optimization, and limited co-adaptation. To address this, we propose CoLA-World—the first framework enabling end-to-end joint training of LAMs with pre-trained world models. CoLA-World introduces a warm-up phase to prevent representation collapse and incorporates representation alignment and progressive optimization mechanisms to establish a bidirectional, mutually reinforcing co-evolutionary loop. By unifying video generation modeling with dynamical system modeling, our method significantly enhances the world model’s generality and out-of-distribution generalization. Experiments demonstrate that CoLA-World surpasses conventional two-stage paradigms in both video simulation fidelity and downstream visual planning performance, validating the effectiveness and robustness of the co-evolutionary modeling paradigm.

Technology Category

Application Category

📝 Abstract

Adapting pre-trained video generation models into controllable world models via latent actions is a promising step towards creating generalist world models. The dominant paradigm adopts a two-stage approach that trains latent action model (LAM) and the world model separately, resulting in redundant training and limiting their potential for co-adaptation. A conceptually simple and appealing idea is to directly replace the forward dynamic model in LAM with a powerful world model and training them jointly, but it is non-trivial and prone to representational collapse. In this work, we propose CoLA-World, which for the first time successfully realizes this synergistic paradigm, resolving the core challenge in joint learning through a critical warm-up phase that effectively aligns the representations of the from-scratch LAM with the pre-trained world model. This unlocks a co-evolution cycle: the world model acts as a knowledgeable tutor, providing gradients to shape a high-quality LAM, while the LAM offers a more precise and adaptable control interface to the world model. Empirically, CoLA-World matches or outperforms prior two-stage methods in both video simulation quality and downstream visual planning, establishing a robust and efficient new paradigm for the field.

Problem

Research questions and friction points this paper is trying to address.

Training latent action and world models separately causes redundant training

Joint training risks representational collapse without proper alignment

Aligning representations enables synergistic co-evolution between models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Co-evolves latent action models with world models

Uses warm-up phase to align representations

Enables joint training to avoid redundant stages

🔎 Similar Papers

No similar papers found.