🤖 AI Summary
To address the high computational cost and poor real-time interactivity of existing video generation models, this paper proposes a single-step autoregressive real-time video generation framework based on latent diffusion models. Methodologically, it introduces adversarial training into autoregressive video generation for the first time, coupled with scheduled sampling (teacher forcing) to mitigate error accumulation; designs a 1NFE (one neural function evaluation) latent-frame generation mechanism enabling streaming output and real-time user control; and integrates KV cache optimization with post-training adaptation to significantly accelerate inference. Experiments demonstrate that an 8B-parameter model achieves real-time generation at 736×416 resolution and 24 fps on a single H100 GPU; under an 8×H100 configuration, it synthesizes high-quality 1280×720 videos at 24 fps for 1440 frames (one minute).
📝 Abstract
Existing large-scale video generation models are computationally intensive, preventing adoption in real-time and interactive applications. In this work, we propose autoregressive adversarial post-training (AAPT) to transform a pre-trained latent video diffusion model into a real-time, interactive video generator. Our model autoregressively generates a latent frame at a time using a single neural function evaluation (1NFE). The model can stream the result to the user in real time and receive interactive responses as controls to generate the next latent frame. Unlike existing approaches, our method explores adversarial training as an effective paradigm for autoregressive generation. This not only allows us to design an architecture that is more efficient for one-step generation while fully utilizing the KV cache, but also enables training the model in a student-forcing manner that proves to be effective in reducing error accumulation during long video generation. Our experiments demonstrate that our 8B model achieves real-time, 24fps, streaming video generation at 736x416 resolution on a single H100, or 1280x720 on 8xH100 up to a minute long (1440 frames). Visit our research website at https://seaweed-apt.com/2