🤖 AI Summary
This work addresses the degraded generalization of diffusion Transformers in cross-resolution image generation, caused by positional encoding mismatch across resolutions. We propose a novel paradigm enabling arbitrary-scale inference from single-resolution training. Our method introduces (1) a two-dimensional random positional encoding (RPE-2D), which models relative ordering—not absolute distances—among image patches; and (2) a synergistic combination of random cropping augmentation and micro-condition-aware modulation to decouple training and inference. Trained exclusively on ImageNet at 256×256 resolution, our model achieves state-of-the-art performance at test resolutions of 384×384 and 512×512. It further supports super-resolution scaling up to 768×768 and 1024×1024 (from 512×512), while significantly improving low-resolution generation fidelity and reducing multi-stage training overhead.
📝 Abstract
Resolution generalization in image generation tasks enables the production of higher-resolution images with lower training resolution overhead. However, a significant challenge in resolution generalization, particularly in the widely used Diffusion Transformers, lies in the mismatch between the positional encodings encountered during testing and those used during training. While existing methods have employed techniques such as interpolation, extrapolation, or their combinations, none have fully resolved this issue. In this paper, we propose a novel two-dimensional randomized positional encodings (RPE-2D) framework that focuses on learning positional order of image patches instead of the specific distances between them, enabling seamless high- and low-resolution image generation without requiring high- and low-resolution image training. Specifically, RPE-2D independently selects positions over a broader range along both the horizontal and vertical axes, ensuring that all position encodings are trained during the inference phase, thus improving resolution generalization. Additionally, we propose a random data augmentation technique to enhance the modeling of position order. To address the issue of image cropping caused by the augmentation, we introduce corresponding micro-conditioning to enable the model to perceive the specific cropping patterns. On the ImageNet dataset, our proposed RPE-2D achieves state-of-the-art resolution generalization performance, outperforming existing competitive methods when trained at a resolution of $256 imes 256$ and inferred at $384 imes 384$ and $512 imes 512$, as well as when scaling from $512 imes 512$ to $768 imes 768$ and $1024 imes 1024$. And it also exhibits outstanding capabilities in low-resolution image generation, multi-stage training acceleration and multi-resolution inheritance.