StableWorld: Towards Stable and Consistent Long Interactive Video Generation

📅 2026-01-21

📈 Citations: 0

✨ Influential: 0

career value

256K/year

🤖 AI Summary

This work addresses the challenge of error accumulation in interactive video generation during prolonged interactions, which often leads to spatial drift and scene collapse. The study presents the first systematic analysis of the root causes underlying this issue and introduces a model-agnostic dynamic frame culling mechanism that mitigates error propagation at its source by continuously evaluating and filtering out geometrically inconsistent frames in real time. The proposed method integrates seamlessly into prevailing interactive generation frameworks—such as Matrix-Game, Open-Oasis, and Hunyuan-GameCraft—and demonstrates consistent improvements across multiple models in terms of long-term video stability, temporal coherence, and cross-scenario generalization, thereby validating its broad applicability and effectiveness.

Technology Category

Application Category

📝 Abstract

In this paper, we explore the overlooked challenge of stability and temporal consistency in interactive video generation, which synthesizes dynamic and controllable video worlds through interactive behaviors such as camera movements and text prompts. Despite remarkable progress in world modeling, current methods still suffer from severe instability and temporal degradation, often leading to spatial drift and scene collapse during long-horizon interactions. To better understand this issue, we initially investigate the underlying causes of instability and identify that the major source of error accumulation originates from the same scene, where generated frames gradually deviate from the initial clean state and propagate errors to subsequent frames. Building upon this observation, we propose a simple yet effective method, \textbf{StableWorld}, a Dynamic Frame Eviction Mechanism. By continuously filtering out degraded frames while retaining geometrically consistent ones, StableWorld effectively prevents cumulative drift at its source, leading to more stable and temporal consistency of interactive generation. Promising results on multiple interactive video models, \eg, Matrix-Game, Open-Oasis, and Hunyuan-GameCraft, demonstrate that StableWorld is model-agnostic and can be applied to different interactive video generation frameworks to substantially improve stability, temporal consistency, and generalization across diverse interactive scenarios.

Problem

Research questions and friction points this paper is trying to address.

interactive video generation

temporal consistency

stability

scene collapse

spatial drift

Innovation

Methods, ideas, or system contributions that make the work stand out.

StableWorld

temporal consistency

interactive video generation