Scale-DiT: Ultra-High-Resolution Image Generation with Hierarchical Local Attention

📅 2025-10-17

📈 Citations: 0

✨ Influential: 0

career value

211K/year

🤖 AI Summary

Existing diffusion models face two key limitations for 4K image generation: the quadratic computational complexity of standard self-attention and the absence of native 4K training data—hindering simultaneous fidelity in fine-grained texture and global structural coherence. To address this, we propose a hierarchical local attention mechanism integrating window-based computation, Hilbert-curve token reordering, scaled positional anchoring, and lightweight LoRA adapters. Our method enables localized detail modeling and semantic alignment under low-resolution global guidance—without requiring 4K training data. It substantially reduces GPU memory consumption and accelerates inference by over 2× compared to dense attention. Quantitatively, our approach matches or surpasses state-of-the-art methods trained on 4K data across FID, Inception Score (IS), and CLIP Score. This work breaks longstanding computational and performance bottlenecks in high-resolution generative modeling.

Technology Category

Application Category

📝 Abstract

Ultra-high-resolution text-to-image generation demands both fine-grained texture synthesis and globally coherent structure, yet current diffusion models remain constrained to sub-$1K imes 1K$ resolutions due to the prohibitive quadratic complexity of attention and the scarcity of native $4K$ training data. We present extbf{Scale-DiT}, a new diffusion framework that introduces hierarchical local attention with low-resolution global guidance, enabling efficient, scalable, and semantically coherent image synthesis at ultra-high resolutions. Specifically, high-resolution latents are divided into fixed-size local windows to reduce attention complexity from quadratic to near-linear, while a low-resolution latent equipped with scaled positional anchors injects global semantics. A lightweight LoRA adaptation bridges global and local pathways during denoising, ensuring consistency across structure and detail. To maximize inference efficiency, we repermute token sequence in Hilbert curve order and implement a fused-kernel for skipping masked operations, resulting in a GPU-friendly design. Extensive experiments demonstrate that Scale-DiT achieves more than $2 imes$ faster inference and lower memory usage compared to dense attention baselines, while reliably scaling to $4K imes 4K$ resolution without requiring additional high-resolution training data. On both quantitative benchmarks (FID, IS, CLIP Score) and qualitative comparisons, Scale-DiT delivers superior global coherence and sharper local detail, matching or outperforming state-of-the-art methods that rely on native 4K training. Taken together, these results highlight hierarchical local attention with guided low-resolution anchors as a promising and effective approach for advancing ultra-high-resolution image generation.

Problem

Research questions and friction points this paper is trying to address.

Enabling ultra-high-resolution image generation beyond 1K×1K constraints

Reducing quadratic attention complexity for efficient 4K image synthesis

Achieving global coherence without requiring native 4K training data

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical local attention reduces quadratic complexity

Low-resolution latent anchors inject global semantics

LoRA adaptation bridges global and local pathways

🔎 Similar Papers

No similar papers found.