REPA Works Until It Doesn't: Early-Stopped, Holistic Alignment Supercharges Diffusion Training

📅 2025-05-22

📈 Citations: 0

✨ Influential: 0

career value

153K/year

🤖 AI Summary

Diffusion Transformers (DiTs) suffer from slow training convergence. While existing representation alignment methods (e.g., REPA) accelerate early-stage training, they induce capacity mismatch, causing performance saturation or degradation in later stages. To address this, we propose HASTE: a two-stage phased alignment scheduling framework—initially jointly distilling mid-layer attention relationships and semantic features from a teacher model, then abruptly terminating alignment to fully unleash the student’s generative capacity. HASTE introduces a multi-objective knowledge distillation loss encompassing attention map distillation and feature projection anchoring, requiring no architectural modifications. On ImageNet 256×256, HASTE achieves baseline FID in only 50 epochs—28× faster than REPA—while significantly improving performance on MS-COCO text-to-image generation. Code is publicly available.

Technology Category

Application Category

📝 Abstract

Diffusion Transformers (DiTs) deliver state-of-the-art image quality, yet their training remains notoriously slow. A recent remedy -- representation alignment (REPA) that matches DiT hidden features to those of a non-generative teacher (e.g. DINO) -- dramatically accelerates the early epochs but plateaus or even degrades performance later. We trace this failure to a capacity mismatch: once the generative student begins modelling the joint data distribution, the teacher's lower-dimensional embeddings and attention patterns become a straitjacket rather than a guide. We then introduce HASTE (Holistic Alignment with Stage-wise Termination for Efficient training), a two-phase schedule that keeps the help and drops the hindrance. Phase I applies a holistic alignment loss that simultaneously distills attention maps (relational priors) and feature projections (semantic anchors) from the teacher into mid-level layers of the DiT, yielding rapid convergence. Phase II then performs one-shot termination that deactivates the alignment loss, once a simple trigger such as a fixed iteration is hit, freeing the DiT to focus on denoising and exploit its generative capacity. HASTE speeds up training of diverse DiTs without architecture changes. On ImageNet 256X256, it reaches the vanilla SiT-XL/2 baseline FID in 50 epochs and matches REPA's best FID in 500 epochs, amounting to a 28X reduction in optimization steps. HASTE also improves text-to-image DiTs on MS-COCO, demonstrating to be a simple yet principled recipe for efficient diffusion training across various tasks. Our code is available at https://github.com/NUS-HPC-AI-Lab/HASTE .

Problem

Research questions and friction points this paper is trying to address.

DiTs training is slow despite high image quality

REPA accelerates early training but later plateaus

Capacity mismatch limits REPA's long-term effectiveness

Innovation

Methods, ideas, or system contributions that make the work stand out.

Holistic alignment loss for rapid convergence

Stage-wise termination to free capacity

No architecture changes needed for speedup

🔎 Similar Papers

A Closer Look at Time Steps is Worthy of Triple Speed-Up for Diffusion Model Training