ScrollScape: Unlocking 32K Image Generation With Video Diffusion Priors

📅 2026-03-25

📈 Citations: 0

✨ Influential: 0

career value

242K/year

🤖 AI Summary

This work addresses structural distortions—such as object duplication and spatial fragmentation—that commonly arise when conventional image diffusion models generate extreme aspect ratio (EAR) ultra-high-resolution images. To ensure long-range structural coherence, the authors reformulate EAR image generation as a continuous video synthesis task, leveraging the temporal consistency inherent in video diffusion models. They introduce Scan Positional Encoding (ScanPE) to emulate a moving camera perspective and devise a Scroll Super-Resolution (ScrollSR) mechanism that integrates video super-resolution priors to circumvent memory constraints, enabling 32K-resolution generation. Starting from a pretrained video diffusion model and fine-tuning on a diverse 3K multi-aspect-ratio image dataset, the proposed method substantially outperforms existing image diffusion baselines, effectively suppressing local artifacts while achieving high global consistency and visual fidelity across a wide range of scenes.

Technology Category

Application Category

📝 Abstract

While diffusion models excel at generating images with conventional dimensions, pushing them to synthesize ultra-high-resolution imagery at extreme aspect ratios (EAR) often triggers catastrophic structural failures, such as object repetition and spatial fragmentation.This limitation fundamentally stems from a lack of robust spatial priors, as static text-to-image models are primarily trained on image distributions with conventional dimensions.To overcome this bottleneck, we present ScrollScape, a novel framework that reformulates EAR image synthesis into a continuous video generation process through two core innovations.By mapping the spatial expansion of a massive canvas to the temporal evolution of video frames, ScrollScape leverages the inherent temporal consistency of video models as a powerful global constraint to ensure long-range structural integrity.Specifically, Scanning Positional Encoding (ScanPE) distributes global coordinates across frames to act as a flexible moving camera, while Scrolling Super-Resolution (ScrollSR) leverages video super-resolution priors to circumvent memory bottlenecks, efficiently scaling outputs to an unprecedented 32K resolution. Fine-tuned on a curated 3K multi-ratio image dataset, ScrollScape effectively aligns pre-trained video priors with the EAR generation task. Extensive evaluations demonstrate that it significantly outperforms existing image-diffusion baselines by eliminating severe localized artifacts. Consequently, our method overcomes inherent structural bottlenecks to ensure exceptional global coherence and visual fidelity across diverse domains at extreme scales.

Problem

Research questions and friction points this paper is trying to address.

extreme aspect ratio

ultra-high-resolution image generation

structural coherence

spatial priors

diffusion models

Innovation

Methods, ideas, or system contributions that make the work stand out.

ScrollScape

extreme aspect ratio

video diffusion priors