Flash-VAED: Plug-and-Play VAE Decoders for Efficient Video Generation

📅 2026-02-22

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

This work addresses the high inference latency of VAE decoders, which has become a critical bottleneck in latent diffusion models for video generation. The authors propose a general, plug-and-play VAE acceleration framework that preserves the latent space distribution with strict fidelity while significantly improving efficiency. By integrating independence-aware channel pruning, staged dominant operator optimization—including an improved causal 3D convolution—and a three-stage dynamic knowledge distillation strategy, the method effectively transfers the capabilities of the original model. Evaluated on the Wan and LTX-Video benchmarks, the approach achieves approximately 6× acceleration in VAE decoding with 96.9% reconstruction performance retention, yielding a 36% end-to-end speedup in video generation while incurring negligible quality degradation.

Technology Category

Application Category

📝 Abstract

Latent diffusion models have enabled high-quality video synthesis, yet their inference remains costly and time-consuming. As diffusion transformers become increasingly efficient, the latency bottleneck inevitably shifts to VAE decoders. To reduce their latency while maintaining quality, we propose a universal acceleration framework for VAE decoders that preserves full alignment with the original latent distribution. Specifically, we propose (1) an independence-aware channel pruning method to effectively mitigate severe channel redundancy, and (2) a stage-wise dominant operator optimization strategy to address the high inference cost of the widely used causal 3D convolutions in VAE decoders. Based on these innovations, we construct a Flash-VAED family. Moreover, we design a three-phase dynamic distillation framework that efficiently transfers the capabilities of the original VAE decoder to Flash-VAED. Extensive experiments on Wan and LTX-Video VAE decoders demonstrate that our method outperforms baselines in both quality and speed, achieving approximately a 6$\times$ speedup while maintaining the reconstruction performance up to 96.9%. Notably, Flash-VAED accelerates the end-to-end generation pipeline by up to 36% with negligible quality drops on VBench-2.0.

Problem

Research questions and friction points this paper is trying to address.

VAE decoder

video generation

inference latency

computational efficiency

latent diffusion models

Innovation

Methods, ideas, or system contributions that make the work stand out.

channel pruning

causal 3D convolution

dynamic distillation