SARA: Structural and Adversarial Representation Alignment for Training-efficient Diffusion Models

📅 2025-03-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Diffusion models face an inherent trade-off between training efficiency and generation quality. Existing representation alignment methods (e.g., REPA) focus solely on local semantic alignment, failing to capture structural relationships within visual representations or enforce global distributional consistency between the encoder and denoising network. To address this, we propose SARA—the first hierarchical representation alignment framework—achieving unified local-to-global optimization via three synergistic constraints: (1) patch-wise local semantic alignment; (2) intra-representation structural consistency modeling via self-correlation matrices; and (3) adversarial alignment of encoder and denoiser distributions. On ImageNet-256, SARA achieves a FID score of 1.36, converging twice as fast as REPA and outperforming state-of-the-art methods. It is the first approach to systematically integrate multi-granularity representation alignment, establishing a new paradigm for efficient and high-fidelity diffusion model training.

Technology Category

Application Category

📝 Abstract
Modern diffusion models encounter a fundamental trade-off between training efficiency and generation quality. While existing representation alignment methods, such as REPA, accelerate convergence through patch-wise alignment, they often fail to capture structural relationships within visual representations and ensure global distribution consistency between pretrained encoders and denoising networks. To address these limitations, we introduce SARA, a hierarchical alignment framework that enforces multi-level representation constraints: (1) patch-wise alignment to preserve local semantic details, (2) autocorrelation matrix alignment to maintain structural consistency within representations, and (3) adversarial distribution alignment to mitigate global representation discrepancies. Unlike previous approaches, SARA explicitly models both intra-representation correlations via self-similarity matrices and inter-distribution coherence via adversarial alignment, enabling comprehensive alignment across local and global scales. Experiments on ImageNet-256 show that SARA achieves an FID of 1.36 while converging twice as fast as REPA, surpassing recent state-of-the-art image generation methods. This work establishes a systematic paradigm for optimizing diffusion training through hierarchical representation alignment.
Problem

Research questions and friction points this paper is trying to address.

Balancing training efficiency and generation quality in diffusion models.
Capturing structural relationships and global distribution consistency in representations.
Improving convergence speed and image generation quality using hierarchical alignment.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical alignment framework for diffusion models
Multi-level representation constraints for training efficiency
Adversarial alignment to reduce global discrepancies
🔎 Similar Papers
2022-09-02ACM Computing SurveysCitations: 1628
Hesen Chen
Hesen Chen
Alibaba Group
Computer Vision
Junyan Wang
Junyan Wang
Postdoctoral Research Fellow, University of Adelaide
Deep LearningComputer VisionGenerative AI
Z
Zhiyu Tan
Fudan University, Shanghai Academy of Artificial Intelligence for Science
H
Hao Li
Fudan University, Shanghai Academy of Artificial Intelligence for Science