🤖 AI Summary
This work addresses structural distortions—such as object duplication and spatial fragmentation—that commonly arise when conventional image diffusion models generate extreme aspect ratio (EAR) ultra-high-resolution images. To ensure long-range structural coherence, the authors reformulate EAR image generation as a continuous video synthesis task, leveraging the temporal consistency inherent in video diffusion models. They introduce Scan Positional Encoding (ScanPE) to emulate a moving camera perspective and devise a Scroll Super-Resolution (ScrollSR) mechanism that integrates video super-resolution priors to circumvent memory constraints, enabling 32K-resolution generation. Starting from a pretrained video diffusion model and fine-tuning on a diverse 3K multi-aspect-ratio image dataset, the proposed method substantially outperforms existing image diffusion baselines, effectively suppressing local artifacts while achieving high global consistency and visual fidelity across a wide range of scenes.
📝 Abstract
While diffusion models excel at generating images with conventional dimensions, pushing them to synthesize ultra-high-resolution imagery at extreme aspect ratios (EAR) often triggers catastrophic structural failures, such as object repetition and spatial fragmentation.This limitation fundamentally stems from a lack of robust spatial priors, as static text-to-image models are primarily trained on image distributions with conventional dimensions.To overcome this bottleneck, we present ScrollScape, a novel framework that reformulates EAR image synthesis into a continuous video generation process through two core innovations.By mapping the spatial expansion of a massive canvas to the temporal evolution of video frames, ScrollScape leverages the inherent temporal consistency of video models as a powerful global constraint to ensure long-range structural integrity.Specifically, Scanning Positional Encoding (ScanPE) distributes global coordinates across frames to act as a flexible moving camera, while Scrolling Super-Resolution (ScrollSR) leverages video super-resolution priors to circumvent memory bottlenecks, efficiently scaling outputs to an unprecedented 32K resolution. Fine-tuned on a curated 3K multi-ratio image dataset, ScrollScape effectively aligns pre-trained video priors with the EAR generation task. Extensive evaluations demonstrate that it significantly outperforms existing image-diffusion baselines by eliminating severe localized artifacts. Consequently, our method overcomes inherent structural bottlenecks to ensure exceptional global coherence and visual fidelity across diverse domains at extreme scales.