🤖 AI Summary
This work identifies that Diffusion Transformers (DiT) often generate high-norm anomalous tokens during image synthesis, leading to local semantic distortions and degraded generation quality. The study reveals, for the first time, that this issue stems from semantic-level disruption of local structures rather than mere numerical instability. To address this, the authors propose a Dual-Stage Registers (DSR) mechanism that unifies token regulation strategies across training, recursive inference, and the diffusion process. Experimental results demonstrate that DSR effectively suppresses anomalous artifacts on ImageNet and large-scale text-to-image benchmarks, significantly enhancing both semantic coherence and visual fidelity of generated images.
📝 Abstract
We study outlier tokens in Diffusion Transformers (DiTs) for image generation. Prior work has shown that Vision Transformers (ViTs) can produce a small number of high-norm tokens that attract disproportionate attention while carrying limited local information, but their role in generative models remains underexplored. We show that this phenomenon appears in both the encoder and denoiser of modern Representation Autoencoder (RAE)-DiT pipelines: pretrained ViT encoders can produce outlier representations, and DiTs themselves can develop internal outlier tokens, especially in intermediate layers. Moreover, simply masking high-norm tokens does not improve performance, indicating that the problem is not only caused by a few extreme values, but is more closely related to corrupted local patch semantics. To address this issue, we introduce Dual-Stage Registers (DSR), a register-based intervention for both components: trained registers when available, recursive test-time registers otherwise, and diffusion registers for the denoiser. Across ImageNet and large-scale text-to-image generation, these interventions consistently reduce outlier artifacts and improve generation quality. Our results highlight outlier-token control as an important ingredient in building stronger DiTs.