🤖 AI Summary
Diffusion Transformers (DiTs) suffer from positional encoding extrapolation failure when generating high-resolution images at inference resolutions differing from training resolutions, leading to severe performance degradation. To address this, we propose the first length-extrapolatable DiT architecture that eliminates explicit positional encodings entirely: it leverages causal attention to implicitly model global sequence ordering and incorporates locality enhancement to distinguish nearby tokens. This design circumvents extrapolation bottlenecks inherent in explicit encodings (e.g., RoPE) and requires only lightweight fine-tuning (100K steps). On ImageNet, our model achieves successful resolution extrapolation—from 256×256 to 512×512 and from 512×512 to 1024×1024—outperforming state-of-the-art methods including NTK-aware and YaRN in both FID and CLIP Score. To our knowledge, this is the first demonstration of high-fidelity resolution extrapolation in diffusion models using a positional-encoding-free architecture.
📝 Abstract
Diffusion transformers(DiTs) struggle to generate images at resolutions higher than their training resolutions. The primary obstacle is that the explicit positional encodings(PE), such as RoPE, need extrapolation which degrades performance when the inference resolution differs from training. In this paper, we propose a Length-Extrapolatable Diffusion Transformer(LEDiT), a simple yet powerful architecture to overcome this limitation. LEDiT needs no explicit PEs, thereby avoiding extrapolation. The key innovations of LEDiT are introducing causal attention to implicitly impart global positional information to tokens, while enhancing locality to precisely distinguish adjacent tokens. Experiments on 256x256 and 512x512 ImageNet show that LEDiT can scale the inference resolution to 512x512 and 1024x1024, respectively, while achieving better image quality compared to current state-of-the-art length extrapolation methods(NTK-aware, YaRN). Moreover, LEDiT achieves strong extrapolation performance with just 100K steps of fine-tuning on a pretrained DiT, demonstrating its potential for integration into existing text-to-image DiTs.