🤖 AI Summary
This work investigates the role of register tokens in pixel-space Diffusion Transformers (DiTs), revealing their distinct functionality compared to Vision Transformers (ViTs): they produce cleaner intermediate features under high-noise conditions. To leverage this insight, the authors propose a parameter-efficient dual-stream architecture that explicitly models the register pathway, significantly accelerating convergence and improving image generation quality with negligible computational overhead. The study not only elucidates the implicit presence of register mechanisms in high-performing DiTs but also establishes a novel architectural paradigm for pixel-level generative tasks.
📝 Abstract
Vision Transformers (ViTs) are known to exhibit high-norm patch-token outliers that degrade feature map quality, a problem effectively mitigated by \textit{register tokens}. As diffusion models increasingly adopt transformer architectures and move toward pixel-space training, they become closer in form to ViTs, raising the question of whether register tokens are also useful for Diffusion Transformers (DiTs). In this work, we show that DiTs differ from ViTs in a key respect: they do not exhibit patch-token outliers. Interestingly, register tokens significantly improve convergence and generation quality of pixel-space DiTs. By analyzing intermediate representations, we find that register tokens produce cleaner feature maps at high noise levels, which may contribute to their effectiveness in pixel-space generation. We further observe that recent pixel-space DiT architectures implicitly incorporate register-like mechanisms, which may partially account for their strong empirical performance. Motivated by these insights, we investigate a parameter-efficient dual-stream architecture that specializes processing for register tokens and improves pixel-space generation quality with negligible runtime overhead.