🤖 AI Summary
This work addresses the challenges of remote sensing image synthesis, which are hindered by the lack of domain-specific generative priors and the high computational cost of training at high resolutions. Existing training-free super-resolution methods struggle to preserve mid- and high-frequency details due to static positional scaling. To overcome this, the authors propose SHARP: first, they fine-tune FLUX on 100,000 remote sensing images to obtain a domain-specific prior model, RS-FLUX; second, they introduce a spectrum-aware dynamic score-time scheduling function \( k_{\text{rs}}(t) \) that enables diffusion-aligned dynamic positional embeddings within RoPE—emphasizing structural layout in early denoising stages and progressively recovering fine details later. Without additional training, SHARP supports multi-scale high-resolution generation, consistently outperforming existing training-free baselines across six square and rectangular resolutions, with notable gains in CLIP Score, Aesthetic Score, and HPSv2, especially under large upscaling factors, while incurring negligible computational overhead.
📝 Abstract
Text-to-image generation powered by Diffusion Transformers (DiTs) has made remarkable strides, yet remote sensing (RS) synthesis lags behind due to two barriers: the absence of a domain-specialized DiT prior and the prohibitive cost of training at the large resolutions that RS applications demand. Training-free resolution promotion via Rotary Position Embedding (RoPE) rescaling offers a practical remedy, but every existing method applies a static positional scaling rule throughout the denoising process. This uniform compression is particularly harmful for RS imagery, whose substantially denser medium- and high-frequency energy encodes the fine structures critical for aerial-scene realism, such as vehicles, building contours, and road markings. Addressing both challenges requires a domain-specialized generative prior coupled with a denoising-aware positional adaptation strategy. To this end, we fine-tune FLUX on over 100,000 curated RS images to build a strong domain prior (RS-FLUX), and propose Spectrum-aware Highly-dynamic Adaptation for Resolution Promotion (SHARP), a training-free method that introduces a rational fractional time schedule k_rs(t) into RoPE. SHARP applies strong positional promotion during the early layout-formation stage and progressively relaxes it during detail recovery, aligning extrapolation strength with the frequency-progressive nature of diffusion denoising. Its resolution-agnostic formulation further enables robust multi-scale generation from a single set of hyperparameters. Extensive experiments across six square and rectangular resolutions show that SHARP consistently outperforms all training-free baselines on CLIP Score, Aesthetic Score, and HPSv2, with widening margins at more aggressive extrapolation factors and negligible computational overhead. Code and weights are available at https://github.com/bxuanz/SHARP.