Rethinking Cross-Layer Information Routing in Diffusion Transformers

📅 2026-05-20
📈 Citations: 0
Influential: 0
📄 PDF

career value

193K/year
🤖 AI Summary
This work addresses the limitations of conventional residual connections in diffusion Transformers, which induce forward information explosion, gradient attenuation, and inter-layer redundancy, thereby constraining generation efficiency and quality. The study introduces Diffusion-Adaptive Routing (DAR)—a novel, learnable mechanism that dynamically aggregates historical sublayer outputs through nonlinear fusion, explicitly treating cross-layer information routing as an independent design dimension. DAR is adaptive to the denoising timestep and operates in a non-incremental manner. It is orthogonal to existing architectural enhancements and seamlessly integrates with modern Transformer improvements such as REPA. On ImageNet at 256×256 resolution, DAR improves the FID of SiT-XL/2 to 7.56 (+2.11) and achieves the baseline’s convergence quality with 8.75× fewer sampling steps, while accelerating early-stage training by up to 2×.
📝 Abstract
Diffusion Transformers (DiTs) have become a de facto backbone of modern visual generation, and nearly every major axis of their design -- tokenization, attention, conditioning, objectives, and latent autoencoders -- has been extensively revisited. The residual stream that governs how information accumulates across layers, however, has been directly inherited from the original Transformer. In this paper, we present a systematic empirical analysis of cross-layer information flow in DiTs, jointly along depth and denoising timestep, and identify three concrete symptoms of traditional residual addition, namely monotonic forward magnitude inflation, sharp backward gradient decay, and pronounced block-wise redundancy. Motivated by this diagnosis, we propose Diffusion-Adaptive Routing (\textsc{DAR}), a drop-in residual replacement that performs \emph{learnable, timestep-adaptive, and non-incremental} aggregation over the history of sublayer outputs. Moreover, the proposed \textsc{DAR} is compatible with many modern Transformer enhancement methods, such as REPA. On ImageNet $256\times256$, \textsc{DAR} improves SiT-XL/2 by $2.11$ FID ($7.56$ vs.\ $9.67$) and matches the baseline's converged quality with $8.75\times$ fewer training iterations. Stacked on top of REPA, it yields a $2\times$ training acceleration in the early stage, suggesting cross-layer information routing as an underexplored design axis in diffusion modeling, one that operates orthogonally to existing representation-alignment objectives. Beyond pretraining, \textsc{DAR} can also be applied during the fine-tuning stage of large-scale T2I models and preserves high-frequency details during Distribution Matching Distillation.
Problem

Research questions and friction points this paper is trying to address.

Diffusion Transformers
residual stream
cross-layer information flow
information routing
denoising timestep
Innovation

Methods, ideas, or system contributions that make the work stand out.

Diffusion Transformers
Cross-layer Routing
Residual Stream
Timestep-adaptive Aggregation
Diffusion-Adaptive Routing
C
Chao Xu
Nanjing University, Alibaba Group
Maohua Li
Maohua Li
Hohai University
Spiking Neural Networks
Q
Qirui Li
Alibaba Group, Zhejiang University
Y
Yixuan Xu
Alibaba Group
Y
Yanke Zhou
Nanjing University, Alibaba Group
Y
Yunhe Li
Alibaba Group, City University of Hong Kong
C
Cuifeng Shen
Alibaba Group
H
Hanlin Tang
Alibaba Group
K
Kan Liu
Alibaba Group
T
Tao Lan
Alibaba Group
L
Lin Qu
Alibaba Group
S
Shao-Qun Zhang
Nanjing University