π€ AI Summary
To address the low inference efficiency and poor generalization to long videos in pre-trained video diffusion models, this paper proposes an efficient acceleration framework. Our method introduces a deeply compressed video autoencoder coupled with a block-wise causal temporal architecture, integrated with an AE-Adapt-V latent-space adaptation strategy to enable stable knowledge transfer under lightweight fine-tuning. The approach achieves high-fidelity video reconstruction while substantially improving both long-video generation capability and inference speed. With only 10 H100 GPU-days of fine-tuning, inference latency is reduced by 14.8Γ, enabling real-time generation of ultra-high-definition videos (2160Γ3840) on a single GPU. The core innovation lies in the synergistic design of deep latent compression and block-wise causal modeling, jointly optimizing inference efficiency, visual quality, and generalization across video lengths.
π Abstract
We introduce DC-VideoGen, a post-training acceleration framework for efficient video generation. DC-VideoGen can be applied to any pre-trained video diffusion model, improving efficiency by adapting it to a deep compression latent space with lightweight fine-tuning. The framework builds on two key innovations: (i) a Deep Compression Video Autoencoder with a novel chunk-causal temporal design that achieves 32x/64x spatial and 4x temporal compression while preserving reconstruction quality and generalization to longer videos; and (ii) AE-Adapt-V, a robust adaptation strategy that enables rapid and stable transfer of pre-trained models into the new latent space. Adapting the pre-trained Wan-2.1-14B model with DC-VideoGen requires only 10 GPU days on the NVIDIA H100 GPU. The accelerated models achieve up to 14.8x lower inference latency than their base counterparts without compromising quality, and further enable 2160x3840 video generation on a single GPU. Code: https://github.com/dc-ai-projects/DC-VideoGen.