🤖 AI Summary
Existing 4K super-resolution diffusion models are constrained by memory limitations, relying on patch-based inference that compromises global semantic coherence, introduces spatial inconsistencies, and incurs high latency. This work proposes the first end-to-end, non-overlapping 4K super-resolution method, leveraging a Flux backbone and an F16 variational autoencoder to generate full 4096×4096 images in a single forward pass within 5.75 seconds on a single NVIDIA H20 GPU. To mitigate periodic artifacts inherent to non-overlapping inference, the approach introduces Rotary Position Embedding frequency rescaling (RFR) and a self-correlation periodicity loss (ℒ_AP). The method achieves superior perceptual quality and computational efficiency, supported by a dedicated training dataset and three newly established evaluation benchmarks.
📝 Abstract
Diffusion-based real-world image super-resolution (Real-ISR) has achieved remarkable perceptual quality; however, directly super-resolving images to 4K remains limited by extreme memory consumption. Consequently, prior methods adopt patch-based inference, sacrificing global context and introducing semantic confusion, spatial inconsistency, and severe latency. We propose OP4KSR, a one-step patch-free 4K SR approach built upon the powerful Flux backbone. By leveraging the extreme-compression F16 VAE, OP4KSR makes 4K SR inference tractable under practical GPU budgets, preserving global spatial-semantic coherence while enabling highly efficient inference. However, adapting this one-step architecture intrinsically triggers severe periodic artifacts. We trace this to a RoPE base frequency allocation mismatch and intra-token spatial ambiguity, both exacerbated by the lack of iterative refinement. To suppress these artifacts, we couple RoPE base frequency rescaling (RFR) with an autocorrelation-based periodicity loss ($\mathcal{L}_\text{AP}$). Furthermore, we curate a dedicated training dataset alongside three benchmarks (one synthetic and two real-world) to advance 4K SR research. Extensive experiments demonstrate that OP4KSR achieves competitive perceptual quality with efficient inference, generating a $4096\times4096$ output in only 5.75 seconds on a single NVIDIA H20 GPU.