🤖 AI Summary
This work addresses the degradation in generation quality in training-free long video synthesis caused by the mismatch between training and inference paradigms and insufficient long-term temporal coherence. To this end, the authors propose MIGA, a frame-level autoregressive framework that introduces a novel two-stage alignment mechanism to bridge the training-inference gap. MIGA further incorporates a dual consistency enhancement strategy comprising self-reflective correction and long-range low-noise frame guidance. Remarkably, without any additional training, MIGA achieves state-of-the-art performance on the VBench and NarrLV benchmarks and enables high-consistency video generation of unlimited length under constant memory constraints.
📝 Abstract
Without incurring significant computational overhead, train-free long video generation aims to enable foundation video generation models to produce longer videos. Frame-level autoregressive frameworks, e.g., FIFO-diffusion, offer the advantage of generating infinitely long videos with constant memory consumption. However, the mismatch between training and inference, coupled with the challenge of maintaining long-term consistency, limits the effective utilization of foundation models. To mitigate these concerns, we propose \textbf{MIGA}, a novel infinite-frame long video generation method. Firstly, we propose an effective two-stage alignment mechanism that mitigates the training-inference gap by reducing the excessive noise span fed to the model. We then introduce an innovative dual consistency enhancement mechanism, where the self-reflection approach corrects early high-noise frames and the long-range frame guidance approach leverages later low-noise frames with broad coverage to steer generation, jointly improving temporal consistency. Extensive experiments on VBench and NarrLV demonstrate the state-of-the-art performance of MIGA. Our project page is available at https://xiaokunfeng.github.io/miga_homepage/.