VGDFR: Diffusion-based Video Generation with Dynamic Latent Frame Rate

📅 2025-04-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the high computational cost and fixed frame-rate limitation of DiT-based video diffusion models, this paper proposes a training-agnostic dynamic frame-rate scheduling framework. Methodologically, it introduces the first latent-space frame sampling strategy that adaptively adjusts frame density according to video motion characteristics, integrated with redundant-frame merging and hierarchical RoPE reconfiguration to jointly optimize semantic coherence and detail fidelity. Key technical components include dynamic token sampling, latent-frame merging, and hierarchical Rotary Positional Embedding preference analysis. Experiments demonstrate that the method achieves up to 3× inference speedup while preserving near-lossless video quality—evidenced by <1.5% FVD degradation and no perceptible visual degradation in human evaluation—outperforming existing acceleration approaches. This work establishes a plug-and-play paradigm for efficient diffusion-based video generation.

Technology Category

Application Category

📝 Abstract
Diffusion Transformer(DiT)-based generation models have achieved remarkable success in video generation. However, their inherent computational demands pose significant efficiency challenges. In this paper, we exploit the inherent temporal non-uniformity of real-world videos and observe that videos exhibit dynamic information density, with high-motion segments demanding greater detail preservation than static scenes. Inspired by this temporal non-uniformity, we propose VGDFR, a training-free approach for Diffusion-based Video Generation with Dynamic Latent Frame Rate. VGDFR adaptively adjusts the number of elements in latent space based on the motion frequency of the latent space content, using fewer tokens for low-frequency segments while preserving detail in high-frequency segments. Specifically, our key contributions are: (1) A dynamic frame rate scheduler for DiT video generation that adaptively assigns frame rates for video segments. (2) A novel latent-space frame merging method to align latent representations with their denoised counterparts before merging those redundant in low-resolution space. (3) A preference analysis of Rotary Positional Embeddings (RoPE) across DiT layers, informing a tailored RoPE strategy optimized for semantic and local information capture. Experiments show that VGDFR can achieve a speedup up to 3x for video generation with minimal quality degradation.
Problem

Research questions and friction points this paper is trying to address.

Efficiency challenges in Diffusion Transformer video generation
Dynamic information density in real-world videos
Adaptive frame rate adjustment for video segments
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic frame rate scheduler for DiT
Latent-space frame merging method
Tailored RoPE strategy optimization
🔎 Similar Papers
No similar papers found.