LongLive: Real-time Interactive Long Video Generation

📅 2025-09-26

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Long-video generation faces three core challenges: high-quality diffusion models suffer from slow inference; causal autoregressive models enable KV caching but struggle with long-range temporal consistency and training memory constraints; and streaming prompt interaction further compromises visual and semantic coherence. This paper proposes a causal frame-level autoregressive architecture, introducing three key innovations: (1) KV re-caching to reuse prior frame representations efficiently; (2) streaming long-horizon fine-tuning for stable adaptation to extended sequences; and (3) frame-level attention anchoring to ensure smooth transitions under dynamic prompt switching, training-inference alignment, and robust long-range temporal modeling. Integrated with short-window attention, INT8 quantization, and H100 GPU optimization, our method achieves 20.7 FPS inference speed and supports 240-second video generation. The 1.3B-parameter model is fine-tuned in just 32 GPU-days and attains state-of-the-art performance on VBench.

Technology Category

Application Category

📝 Abstract

We present LongLive, a frame-level autoregressive (AR) framework for real-time and interactive long video generation. Long video generation presents challenges in both efficiency and quality. Diffusion and Diffusion-Forcing models can produce high-quality videos but suffer from low efficiency due to bidirectional attention. Causal attention AR models support KV caching for faster inference, but often degrade in quality on long videos due to memory challenges during long-video training. In addition, beyond static prompt-based generation, interactive capabilities, such as streaming prompt inputs, are critical for dynamic content creation, enabling users to guide narratives in real time. This interactive requirement significantly increases complexity, especially in ensuring visual consistency and semantic coherence during prompt transitions. To address these challenges, LongLive adopts a causal, frame-level AR design that integrates a KV-recache mechanism that refreshes cached states with new prompts for smooth, adherent switches; streaming long tuning to enable long video training and to align training and inference (train-long-test-long); and short window attention paired with a frame-level attention sink, shorten as frame sink, preserving long-range consistency while enabling faster generation. With these key designs, LongLive fine-tunes a 1.3B-parameter short-clip model to minute-long generation in just 32 GPU-days. At inference, LongLive sustains 20.7 FPS on a single NVIDIA H100, achieves strong performance on VBench in both short and long videos. LongLive supports up to 240-second videos on a single H100 GPU. LongLive further supports INT8-quantized inference with only marginal quality loss.

Problem

Research questions and friction points this paper is trying to address.

Real-time interactive long video generation with efficiency and quality challenges

Addressing memory issues in autoregressive models for long video training

Enabling smooth prompt transitions and visual consistency in dynamic content creation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Frame-level autoregressive design with KV-recache mechanism

Streaming long tuning for long video training

Short window attention paired with frame sink

🔎 Similar Papers

No similar papers found.

Authors to Follow