🤖 AI Summary
Diffusion Transformers (DiTs) suffer from slow training convergence. While existing representation alignment methods (e.g., REPA) accelerate early-stage training, they induce capacity mismatch, causing performance saturation or degradation in later stages. To address this, we propose HASTE: a two-stage phased alignment scheduling framework—initially jointly distilling mid-layer attention relationships and semantic features from a teacher model, then abruptly terminating alignment to fully unleash the student’s generative capacity. HASTE introduces a multi-objective knowledge distillation loss encompassing attention map distillation and feature projection anchoring, requiring no architectural modifications. On ImageNet 256×256, HASTE achieves baseline FID in only 50 epochs—28× faster than REPA—while significantly improving performance on MS-COCO text-to-image generation. Code is publicly available.
📝 Abstract
Diffusion Transformers (DiTs) deliver state-of-the-art image quality, yet their training remains notoriously slow. A recent remedy -- representation alignment (REPA) that matches DiT hidden features to those of a non-generative teacher (e.g. DINO) -- dramatically accelerates the early epochs but plateaus or even degrades performance later. We trace this failure to a capacity mismatch: once the generative student begins modelling the joint data distribution, the teacher's lower-dimensional embeddings and attention patterns become a straitjacket rather than a guide. We then introduce HASTE (Holistic Alignment with Stage-wise Termination for Efficient training), a two-phase schedule that keeps the help and drops the hindrance. Phase I applies a holistic alignment loss that simultaneously distills attention maps (relational priors) and feature projections (semantic anchors) from the teacher into mid-level layers of the DiT, yielding rapid convergence. Phase II then performs one-shot termination that deactivates the alignment loss, once a simple trigger such as a fixed iteration is hit, freeing the DiT to focus on denoising and exploit its generative capacity. HASTE speeds up training of diverse DiTs without architecture changes. On ImageNet 256X256, it reaches the vanilla SiT-XL/2 baseline FID in 50 epochs and matches REPA's best FID in 500 epochs, amounting to a 28X reduction in optimization steps. HASTE also improves text-to-image DiTs on MS-COCO, demonstrating to be a simple yet principled recipe for efficient diffusion training across various tasks. Our code is available at https://github.com/NUS-HPC-AI-Lab/HASTE .