VAE-REPA: Variational Autoencoder Representation Alignment for Efficient Diffusion Training

📅 2026-01-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the slow convergence of diffusion model training, a challenge exacerbated by existing acceleration methods that rely on external encoders or dual-model architectures, incurring substantial computational overhead. To overcome this, the authors propose a lightweight, endogenous guidance framework that leverages the intrinsic visual priors of a pretrained VAE. By introducing a lightweight projection layer, the method aligns the VAE’s reconstruction features with intermediate representations in a diffusion Transformer, guided by a dedicated feature alignment loss. Notably, this approach requires no additional models and incurs only a 4% increase in GFLOPs. Experiments across multiple benchmarks demonstrate significant improvements in both training convergence speed and generation quality, matching or surpassing state-of-the-art acceleration techniques while maintaining simplicity, generality, and computational efficiency.

Technology Category

Application Category

📝 Abstract
Denoising-based diffusion transformers, despite their strong generation performance, suffer from inefficient training convergence. Existing methods addressing this issue, such as REPA (relying on external representation encoders) or SRA (requiring dual-model setups), inevitably incur heavy computational overhead during training due to external dependencies. To tackle these challenges, this paper proposes \textbf{\namex}, a lightweight intrinsic guidance framework for efficient diffusion training. \name leverages off-the-shelf pre-trained Variational Autoencoder (VAE) features: their reconstruction property ensures inherent encoding of visual priors like rich texture details, structural patterns, and basic semantic information. Specifically, \name aligns the intermediate latent features of diffusion transformers with VAE features via a lightweight projection layer, supervised by a feature alignment loss. This design accelerates training without extra representation encoders or dual-model maintenance, resulting in a simple yet effective pipeline. Extensive experiments demonstrate that \name improves both generation quality and training convergence speed compared to vanilla diffusion transformers, matches or outperforms state-of-the-art acceleration methods, and incurs merely 4\% extra GFLOPs with zero additional cost for external guidance models.
Problem

Research questions and friction points this paper is trying to address.

diffusion training
training convergence
computational overhead
representation alignment
efficient training
Innovation

Methods, ideas, or system contributions that make the work stand out.

Variational Autoencoder
Representation Alignment
Diffusion Transformer
Efficient Training
Feature Alignment
Mengmeng Wang
Mengmeng Wang
Zhejiang University
computer visiondeep learning
Dengyang Jiang
Dengyang Jiang
Northwestern Polytechnical University
Computer VisionDeep LearningMachine Learning
L
Liuzhuozheng Li
SGIT AI Lab, State Grid Corporation of China
Y
Yucheng Lin
Zhejiang University of Technology
G
Guojiang Shen
Zhejiang University of Technology
X
Xiangjie Kong
Zhejiang University of Technology
Yong Liu
Yong Liu
Institute of Cyber-Systems and Control, Zhejiang University
Robotic Vision and PerceptionGraphicsInformation Fusion
G
Guang Dai
SGIT AI Lab, State Grid Corporation of China
J
Jingdong Wang
Baidu