LeanVAE: An Ultra-Efficient Reconstruction VAE for Video Diffusion Models

📅 2025-03-18

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

To address the high computational cost and inefficiency of Video VAEs in encoding high-resolution videos for Latent Video Diffusion Models (LVDMs), this work proposes a highly efficient video autoencoder framework. Methodologically, we introduce a lightweight Neighborhood-Aware Feedforward (NAF) module, adopt a non-overlapping block-wise encoding strategy, and—novelly—jointly leverage discrete wavelet transform and compressed sensing for latent-space reconstruction, balancing efficiency and fidelity. Experiments demonstrate that our framework reduces computational cost by 50× and accelerates inference by 44× over state-of-the-art Video VAEs, while maintaining competitive PSNR and SSIM performance. The core contribution lies in the synergistic design of neighborhood-aware feedforward modeling and multi-scale sparse reconstruction, establishing a scalable implicit representation foundation for large-scale video diffusion models.

Technology Category

Application Category

📝 Abstract

Recent advances in Latent Video Diffusion Models (LVDMs) have revolutionized video generation by leveraging Video Variational Autoencoders (Video VAEs) to compress intricate video data into a compact latent space.However, as LVDM training scales, the computational overhead of Video VAEs becomes a critical bottleneck, particularly for encoding high-resolution videos. To address this, we propose LeanVAE, a novel and ultra-efficient Video VAE framework that introduces two key innovations: (1) a lightweight architecture based on a Neighborhood-Aware Feedforward (NAF) module and non-overlapping patch operations, drastically reducing computational cost, and (2) the integration of wavelet transforms and compressed sensing techniques to enhance reconstruction quality. Extensive experiments validate LeanVAE's superiority in video reconstruction and generation, particularly in enhancing efficiency over existing Video VAEs.Our model offers up to 50x fewer FLOPs and 44x faster inference speed while maintaining competitive reconstruction quality, providing insights for scalable, efficient video generation.Our models and code are available at https://github.com/westlake-repl/LeanVAE.

Problem

Research questions and friction points this paper is trying to address.

Reduces computational cost in video encoding

Enhances video reconstruction quality efficiently

Improves scalability of video generation models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Lightweight architecture with NAF module

Wavelet transforms enhance reconstruction quality

Compressed sensing reduces computational overhead

🔎 Similar Papers

MotionAura: Generating High-Quality and Motion Consistent Videos using Discrete Diffusion

2024-10-10arXiv.orgCitations: 0

TikTok

San Jose, California

Research Engineer/Scientist (all levels), Efficient Models

TikTok

San Jose, California

AI Research Scientist, Computer Vision - Facebook Video Intelligence