StreamDiffusionV2: A Streaming System for Dynamic and Interactive Video Generation

📅 2025-11-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing video diffusion models suffer from poor temporal consistency; offline video diffusion systems fail to meet the low-latency and low-jitter requirements of live streaming; and multi-GPU streaming services lack scalability for dynamic interactive video generation. To address these challenges, this paper proposes the first SLO-aware streaming system tailored for dynamic interactive video generation. Our approach integrates rolling KV caching, sink-token-guided training-free streaming inference, motion-aware noise control, and cross-denoising-step/layer parallel orchestration, complemented by a lightweight block scheduler enabling heterogeneous multi-GPU pipelining. Experiments on 4×H100 demonstrate that our system achieves 58.28 FPS (14B model) and 64.52 FPS (1.3B model), with first-frame latency under 0.5 seconds and support for flexible 1–4-step denoising—marking the first realization of high-quality, scalable, streaming video diffusion generation under strict latency constraints.

Technology Category

Application Category

📝 Abstract
Generative models are reshaping the live-streaming industry by redefining how content is created, styled, and delivered. Previous image-based streaming diffusion models have powered efficient and creative live streaming products but have hit limits on temporal consistency due to the foundation of image-based designs. Recent advances in video diffusion have markedly improved temporal consistency and sampling efficiency for offline generation. However, offline generation systems primarily optimize throughput by batching large workloads. In contrast, live online streaming operates under strict service-level objectives (SLOs): time-to-first-frame must be minimal, and every frame must meet a per-frame deadline with low jitter. Besides, scalable multi-GPU serving for real-time streams remains largely unresolved so far. To address this, we present StreamDiffusionV2, a training-free pipeline for interactive live streaming with video diffusion models. StreamDiffusionV2 integrates an SLO-aware batching scheduler and a block scheduler, together with a sink-token--guided rolling KV cache, a motion-aware noise controller, and other system-level optimizations. Moreover, we introduce a scalable pipeline orchestration that parallelizes the diffusion process across denoising steps and network layers, achieving near-linear FPS scaling without violating latency guarantees. The system scales seamlessly across heterogeneous GPU environments and supports flexible denoising steps (e.g., 1--4), enabling both ultra-low-latency and higher-quality modes. Without TensorRT or quantization, StreamDiffusionV2 renders the first frame within 0.5s and attains 58.28 FPS with a 14B-parameter model and 64.52 FPS with a 1.3B-parameter model on four H100 GPUs, making state-of-the-art generative live streaming practical and accessible--from individual creators to enterprise-scale platforms.
Problem

Research questions and friction points this paper is trying to address.

Achieving temporal consistency in live video streaming with diffusion models
Meeting strict latency requirements for real-time interactive video generation
Scaling video diffusion models efficiently across multiple GPUs for streaming
Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free pipeline for interactive video diffusion streaming
SLO-aware batching scheduler with rolling KV cache
Scalable pipeline orchestration across denoising steps and layers
🔎 Similar Papers