🤖 AI Summary
This work addresses the challenge of error accumulation in autoregressive video world models, which often leads to degraded performance and instability when generating future frames far beyond the training rollout length. To mitigate this issue, the authors propose LIVE, a diffusion-based model that explicitly suppresses error propagation through a cycle-consistency objective combining forward action-conditioned prediction and backward state reconstruction. This design enables high-quality interactive video generation without relying on teacher distillation. Coupled with a progressive curriculum training strategy, LIVE achieves state-of-the-art performance on long-horizon video generation benchmarks, demonstrating the ability to stably produce high-fidelity videos significantly longer than those seen during training.
📝 Abstract
Autoregressive video world models predict future visual observations conditioned on actions. While effective over short horizons, these models often struggle with long-horizon generation, as small prediction errors accumulate over time. Prior methods alleviate this by introducing pre-trained teacher models and sequence-level distribution matching, which incur additional computational cost and fail to prevent error propagation beyond the training horizon. In this work, we propose LIVE, a Long-horizon Interactive Video world modEl that enforces bounded error accumulation via a novel cycle-consistency objective, thereby eliminating the need for teacher-based distillation. Specifically, LIVE first performs a forward rollout from ground-truth frames and then applies a reverse generation process to reconstruct the initial state. The diffusion loss is subsequently computed on the reconstructed terminal state, providing an explicit constraint on long-horizon error propagation. Moreover, we provide an unified view that encompasses different approaches and introduce progressive training curriculum to stabilize training. Experiments demonstrate that LIVE achieves state-of-the-art performance on long-horizon benchmarks, generating stable, high-quality videos far beyond training rollout lengths.