StreamDiT: Real-Time Streaming Text-to-Video Generation

📅 2025-07-04

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

Existing text-to-video (T2V) models are constrained by offline, short-clip generation paradigms, hindering interactive and real-time applications. This paper proposes StreamDiT, a streaming video generation architecture that integrates dynamic buffer management, windowed attention, and multi-step segment-wise distillation to enhance visual fidelity and inference efficiency while preserving inter-frame consistency. Built upon an adaLN-modulated DiT backbone, StreamDiT incorporates flow matching, time-aware embedding modulation, and hybrid training to substantially reduce computational overhead. The distilled 4-billion-parameter model enables end-to-end real-time video streaming at 512p resolution and 16 FPS on a single GPU—the first T2V system capable of high-resolution continuous video streaming. It significantly outperforms prior methods in both quality and latency, establishing a new benchmark for practical, interactive T2V deployment.

Technology Category

Application Category

📝 Abstract

Recently, great progress has been achieved in text-to-video (T2V) generation by scaling transformer-based diffusion models to billions of parameters, which can generate high-quality videos. However, existing models typically produce only short clips offline, restricting their use cases in interactive and real-time applications. This paper addresses these challenges by proposing StreamDiT, a streaming video generation model. StreamDiT training is based on flow matching by adding a moving buffer. We design mixed training with different partitioning schemes of buffered frames to boost both content consistency and visual quality. StreamDiT modeling is based on adaLN DiT with varying time embedding and window attention. To practice the proposed method, we train a StreamDiT model with 4B parameters. In addition, we propose a multistep distillation method tailored for StreamDiT. Sampling distillation is performed in each segment of a chosen partitioning scheme. After distillation, the total number of function evaluations (NFEs) is reduced to the number of chunks in a buffer. Finally, our distilled model reaches real-time performance at 16 FPS on one GPU, which can generate video streams at 512p resolution. We evaluate our method through both quantitative metrics and human evaluation. Our model enables real-time applications, e.g. streaming generation, interactive generation, and video-to-video. We provide video results and more examples in our project website: <a href="https://cumulo-autumn.github.io/StreamDiT/">this https URL.</a>

Problem

Research questions and friction points this paper is trying to address.

Real-time streaming text-to-video generation challenge

Overcoming offline short clip limitations in T2V models

Achieving high-quality video generation with low latency

Innovation

Methods, ideas, or system contributions that make the work stand out.

StreamDiT uses flow matching with moving buffer

Mixed training boosts consistency and quality

Multistep distillation reduces function evaluations

🔎 Similar Papers

No similar papers found.