Generating, Fast and Slow: Scalable Parallel Video Generation with Video Interface Networks

📅 2025-03-21

📈 Citations: 0

✨ Influential: 0

career value

221K/year

🤖 AI Summary

To address the high computational cost and poor temporal consistency in long-video generation, this paper proposes Video Interface Networks (VINs), a lightweight diffusion Transformer (DiT)-based architecture. VINs decouple input-length dependency via fixed-size encoding tokens and single-step cross-attention, enabling end-to-end parallel block-wise denoising for arbitrarily long videos. By integrating inter-block semantic encoding and optical-flow-guided motion optimization, VINs significantly reduce redundant computation while preserving global semantic coherence. On VBench, VINs outperform existing tiling-based methods in background consistency and subject coherence, achieve state-of-the-art motion smoothness, and reduce FLOPs by 25–40%. User studies confirm substantial human preference gains in both video quality and temporal consistency.

Technology Category

Application Category

📝 Abstract

Diffusion Transformers (DiTs) can generate short photorealistic videos, yet directly training and sampling longer videos with full attention across the video remains computationally challenging. Alternative methods break long videos down into sequential generation of short video segments, requiring multiple sampling chain iterations and specialized consistency modules. To overcome these challenges, we introduce a new paradigm called Video Interface Networks (VINs), which augment DiTs with an abstraction module to enable parallel inference of video chunks. At each diffusion step, VINs encode global semantics from the noisy input of local chunks and the encoded representations, in turn, guide DiTs in denoising chunks in parallel. The coupling of VIN and DiT is learned end-to-end on the denoising objective. Further, the VIN architecture maintains fixed-size encoding tokens that encode the input via a single cross-attention step. Disentangling the encoding tokens from the input thus enables VIN to scale to long videos and learn essential semantics. Experiments on VBench demonstrate that VINs surpass existing chunk-based methods in preserving background consistency and subject coherence. We then show via an optical flow analysis that our approach attains state-of-the-art motion smoothness while using 25-40% fewer FLOPs than full generation. Finally, human raters favorably assessed the overall video quality and temporal consistency of our method in a user study.

Problem

Research questions and friction points this paper is trying to address.

Overcoming computational challenges in long video generation

Enhancing video consistency and coherence in parallel generation

Reducing FLOPs while maintaining high-quality motion smoothness

Innovation

Methods, ideas, or system contributions that make the work stand out.

Parallel video chunk inference with VINs

Fixed-size encoding tokens for scalability

End-to-end learning for denoising optimization

🔎 Similar Papers

Real-Time Video Generation with Pyramid Attention Broadcast