🤖 AI Summary
Existing diffusion models face two key limitations for 4K image generation: the quadratic computational complexity of standard self-attention and the absence of native 4K training data—hindering simultaneous fidelity in fine-grained texture and global structural coherence. To address this, we propose a hierarchical local attention mechanism integrating window-based computation, Hilbert-curve token reordering, scaled positional anchoring, and lightweight LoRA adapters. Our method enables localized detail modeling and semantic alignment under low-resolution global guidance—without requiring 4K training data. It substantially reduces GPU memory consumption and accelerates inference by over 2× compared to dense attention. Quantitatively, our approach matches or surpasses state-of-the-art methods trained on 4K data across FID, Inception Score (IS), and CLIP Score. This work breaks longstanding computational and performance bottlenecks in high-resolution generative modeling.
📝 Abstract
Ultra-high-resolution text-to-image generation demands both fine-grained texture synthesis and globally coherent structure, yet current diffusion models remain constrained to sub-$1K imes 1K$ resolutions due to the prohibitive quadratic complexity of attention and the scarcity of native $4K$ training data. We present extbf{Scale-DiT}, a new diffusion framework that introduces hierarchical local attention with low-resolution global guidance, enabling efficient, scalable, and semantically coherent image synthesis at ultra-high resolutions. Specifically, high-resolution latents are divided into fixed-size local windows to reduce attention complexity from quadratic to near-linear, while a low-resolution latent equipped with scaled positional anchors injects global semantics. A lightweight LoRA adaptation bridges global and local pathways during denoising, ensuring consistency across structure and detail. To maximize inference efficiency, we repermute token sequence in Hilbert curve order and implement a fused-kernel for skipping masked operations, resulting in a GPU-friendly design. Extensive experiments demonstrate that Scale-DiT achieves more than $2 imes$ faster inference and lower memory usage compared to dense attention baselines, while reliably scaling to $4K imes 4K$ resolution without requiring additional high-resolution training data. On both quantitative benchmarks (FID, IS, CLIP Score) and qualitative comparisons, Scale-DiT delivers superior global coherence and sharper local detail, matching or outperforming state-of-the-art methods that rely on native 4K training. Taken together, these results highlight hierarchical local attention with guided low-resolution anchors as a promising and effective approach for advancing ultra-high-resolution image generation.