DC-VideoGen: Efficient Video Generation with Deep Compression Video Autoencoder

📅 2025-09-29

📈 Citations: 0

✨ Influential: 0

career value

231K/year

🤖 AI Summary

To address the low inference efficiency and poor generalization to long videos in pre-trained video diffusion models, this paper proposes an efficient acceleration framework. Our method introduces a deeply compressed video autoencoder coupled with a block-wise causal temporal architecture, integrated with an AE-Adapt-V latent-space adaptation strategy to enable stable knowledge transfer under lightweight fine-tuning. The approach achieves high-fidelity video reconstruction while substantially improving both long-video generation capability and inference speed. With only 10 H100 GPU-days of fine-tuning, inference latency is reduced by 14.8×, enabling real-time generation of ultra-high-definition videos (2160×3840) on a single GPU. The core innovation lies in the synergistic design of deep latent compression and block-wise causal modeling, jointly optimizing inference efficiency, visual quality, and generalization across video lengths.

Technology Category

Application Category

📝 Abstract

We introduce DC-VideoGen, a post-training acceleration framework for efficient video generation. DC-VideoGen can be applied to any pre-trained video diffusion model, improving efficiency by adapting it to a deep compression latent space with lightweight fine-tuning. The framework builds on two key innovations: (i) a Deep Compression Video Autoencoder with a novel chunk-causal temporal design that achieves 32x/64x spatial and 4x temporal compression while preserving reconstruction quality and generalization to longer videos; and (ii) AE-Adapt-V, a robust adaptation strategy that enables rapid and stable transfer of pre-trained models into the new latent space. Adapting the pre-trained Wan-2.1-14B model with DC-VideoGen requires only 10 GPU days on the NVIDIA H100 GPU. The accelerated models achieve up to 14.8x lower inference latency than their base counterparts without compromising quality, and further enable 2160x3840 video generation on a single GPU. Code: https://github.com/dc-ai-projects/DC-VideoGen.

Problem

Research questions and friction points this paper is trying to address.

Accelerates video generation using deep compression autoencoder

Enables efficient adaptation of pre-trained video diffusion models

Achieves high compression while maintaining video quality

Innovation

Methods, ideas, or system contributions that make the work stand out.

Deep compression autoencoder with chunk-causal temporal design

Lightweight fine-tuning adaptation strategy AE-Adapt-V

Accelerates video diffusion models while preserving quality

🔎 Similar Papers

Pyramidal Flow Matching for Efficient Video Generative Modeling