LIVE: Long-horizon Interactive Video World Modeling

📅 2026-02-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of error accumulation in autoregressive video world models, which often leads to degraded performance and instability when generating future frames far beyond the training rollout length. To mitigate this issue, the authors propose LIVE, a diffusion-based model that explicitly suppresses error propagation through a cycle-consistency objective combining forward action-conditioned prediction and backward state reconstruction. This design enables high-quality interactive video generation without relying on teacher distillation. Coupled with a progressive curriculum training strategy, LIVE achieves state-of-the-art performance on long-horizon video generation benchmarks, demonstrating the ability to stably produce high-fidelity videos significantly longer than those seen during training.

Technology Category

Application Category

📝 Abstract
Autoregressive video world models predict future visual observations conditioned on actions. While effective over short horizons, these models often struggle with long-horizon generation, as small prediction errors accumulate over time. Prior methods alleviate this by introducing pre-trained teacher models and sequence-level distribution matching, which incur additional computational cost and fail to prevent error propagation beyond the training horizon. In this work, we propose LIVE, a Long-horizon Interactive Video world modEl that enforces bounded error accumulation via a novel cycle-consistency objective, thereby eliminating the need for teacher-based distillation. Specifically, LIVE first performs a forward rollout from ground-truth frames and then applies a reverse generation process to reconstruct the initial state. The diffusion loss is subsequently computed on the reconstructed terminal state, providing an explicit constraint on long-horizon error propagation. Moreover, we provide an unified view that encompasses different approaches and introduce progressive training curriculum to stabilize training. Experiments demonstrate that LIVE achieves state-of-the-art performance on long-horizon benchmarks, generating stable, high-quality videos far beyond training rollout lengths.
Problem

Research questions and friction points this paper is trying to address.

long-horizon video prediction
error accumulation
video world models
autoregressive modeling
interactive video generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

cycle-consistency
long-horizon video generation
error accumulation control
diffusion-based world modeling
teacher-free training
🔎 Similar Papers
No similar papers found.