🤖 AI Summary
Diffusion Transformers exhibit significant performance degradation when generating images beyond their training resolution, and existing test-time methods struggle to simultaneously preserve global structure and local details. To address this limitation, this work proposes a training-free, inference-stage attention modulation approach that introduces, for the first time, a spatial spectral energy-guided adaptive mechanism to dynamically adjust the attention scaling weights of individual frequency components in Rotary Position Embedding (RoPE). By moving beyond conventional uniform scaling strategies, the method effectively overcomes the trade-off between structural coherence and detail fidelity, achieving substantial improvements over current training-free techniques across various high-resolution image generation tasks.
📝 Abstract
Diffusion transformers (DiTs) have emerged as a dominant architecture for text-to-image generation, yet their performance drops when generating at resolutions beyond their training range. Existing training-free approaches mitigate this by modifying inference-time attention behavior, often through Rotary Position Embeddings (RoPE) extrapolation combined with attention scaling. However, these strategies apply a uniform and content-agnostic scaling across RoPE components with distinct frequency characteristics, inducing a trade-off between preserving global structure and recovering fine detail. We introduce SEGA, a training-free method that dynamically scales attention across RoPE components according to the latent's spatial-frequency structure at each denoising step. This adaptive scaling improves both structural coherence and fine-detail fidelity. Experiments show that SEGA consistently improves high-resolution synthesis across multiple target resolutions, outperforming state-of-the-art training-free baselines.