UltraImage: Rethinking Resolution Extrapolation in Image Diffusion Transformers

📅 2025-12-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Image diffusion Transformers suffer from content repetition and quality degradation during extreme-resolution super-resolution extrapolation (e.g., beyond 4K). This paper presents the first systematic framework tailored for such high-fidelity extrapolation. First, a frequency-domain analysis based on positional embeddings reveals that repetition stems from uncontrolled dominant frequencies in the latent representation. Second, we propose a recursive dominant-frequency correction module to suppress periodic artifacts. Third, we introduce an entropy-guided adaptive attention focusing mechanism that dynamically balances local detail reconstruction with global structural consistency, augmented by attention sparsification for computational efficiency. Evaluated on Qwen-Image and Flux architectures, our method enables unconditional generation of 6K×6K images—without low-resolution guidance—achieving state-of-the-art improvements: significantly reduced repetition rates, enhanced visual fidelity, and superior structural coherence.

Technology Category

Application Category

📝 Abstract
Recent image diffusion transformers achieve high-fidelity generation, but struggle to generate images beyond these scales, suffering from content repetition and quality degradation. In this work, we present UltraImage, a principled framework that addresses both issues. Through frequency-wise analysis of positional embeddings, we identify that repetition arises from the periodicity of the dominant frequency, whose period aligns with the training resolution. We introduce a recursive dominant frequency correction to constrain it within a single period after extrapolation. Furthermore, we find that quality degradation stems from diluted attention and thus propose entropy-guided adaptive attention concentration, which assigns higher focus factors to sharpen local attention for fine detail and lower ones to global attention patterns to preserve structural consistency. Experiments show that UltraImage consistently outperforms prior methods on Qwen-Image and Flux (around 4K) across three generation scenarios, reducing repetition and improving visual fidelity. Moreover, UltraImage can generate images up to 6K*6K without low-resolution guidance from a training resolution of 1328p, demonstrating its extreme extrapolation capability. Project page is available at href{https://thu-ml.github.io/ultraimage.github.io/}{https://thu-ml.github.io/ultraimage.github.io/}.
Problem

Research questions and friction points this paper is trying to address.

Addresses content repetition in image diffusion transformers during resolution extrapolation
Solves quality degradation from diluted attention in high-resolution image generation
Enables extreme resolution extrapolation beyond training scales without low-resolution guidance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Recursive dominant frequency correction for resolution extrapolation
Entropy-guided adaptive attention concentration for detail enhancement
Generating high-resolution images up to 6K without low-resolution guidance
🔎 Similar Papers
No similar papers found.
M
Min Zhao
Dept. of Comp. Sci. & Tech., BNRist Center, Tsinghua University
B
Bokai Yan
Gaoling School of Artificial Intelligence, Renmin University of China
X
Xue Yang
Dept. of Comp. Sci. & Tech., BNRist Center, Tsinghua University
Hongzhou Zhu
Hongzhou Zhu
Tsinghua University
Generative Models
Jintao Zhang
Jintao Zhang
Tsinghua University
Efficient MLMlsysSystem for AIMachine LearningDataBase
Shilong Liu
Shilong Liu
RS@ByteDance, PhD@THU
Computer VisionObject DetectionVisual GroundingMulti-ModalityMultimodal Agent
Chongxuan Li
Chongxuan Li
Associate Professor, Renmin University of China
Machine LearningGenerative ModelsDeep Learning
J
Jun Zhu
Dept. of Comp. Sci. & Tech., BNRist Center, Tsinghua University