HiStream: Efficient High-Resolution Video Generation via Redundancy-Eliminated Streaming

📅 2025-12-24

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

High-resolution video generation is hindered by the quadratic computational complexity of diffusion models, resulting in prohibitively slow inference. To address this, we propose the first streaming autoregressive framework tailored for high-resolution video generation, overcoming the bottleneck via a triply redundant elimination strategy: (1) spatial compression through low-resolution denoising and cached feature upsampling; (2) efficient temporal modeling via fixed-anchor block-wise processing; and (3) sparse step scheduling guided by conditional caching to optimize sampling. Our method integrates spatiotemporal joint compression, progressive refinement, and feature reuse. On 1080p video generation, it achieves 76.2× speedup over Wan2.1 (main model) or 107.5× (HiStream+), while attaining state-of-the-art visual quality with negligible perceptual degradation.

Technology Category

Application Category

📝 Abstract

High-resolution video generation, while crucial for digital media and film, is computationally bottlenecked by the quadratic complexity of diffusion models, making practical inference infeasible. To address this, we introduce HiStream, an efficient autoregressive framework that systematically reduces redundancy across three axes: i) Spatial Compression: denoising at low resolution before refining at high resolution with cached features; ii) Temporal Compression: a chunk-by-chunk strategy with a fixed-size anchor cache, ensuring stable inference speed; and iii) Timestep Compression: applying fewer denoising steps to subsequent, cache-conditioned chunks. On 1080p benchmarks, our primary HiStream model (i+ii) achieves state-of-the-art visual quality while demonstrating up to 76.2x faster denoising compared to the Wan2.1 baseline and negligible quality loss. Our faster variant, HiStream+, applies all three optimizations (i+ii+iii), achieving a 107.5x acceleration over the baseline, offering a compelling trade-off between speed and quality, thereby making high-resolution video generation both practical and scalable.

Problem

Research questions and friction points this paper is trying to address.

Reduces computational complexity in high-resolution video generation

Eliminates redundancy across spatial, temporal, and timestep dimensions

Accelerates denoising for practical and scalable video inference

Innovation

Methods, ideas, or system contributions that make the work stand out.

Spatial compression with cached feature refinement

Temporal compression using chunk-by-chunk anchor caching

Timestep compression for cache-conditioned subsequent chunks

🔎 Similar Papers

Pyramidal Flow Matching for Efficient Video Generative Modeling