FrescoDiffusion: 4K Image-to-Video with Prior-Regularized Tiled Diffusion

📅 2026-03-18

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

Existing diffusion models struggle to simultaneously preserve global layout consistency and fine local details when generating 4K-resolution image-to-video sequences, particularly in complex scenes involving multiple characters and semantically rich regions, such as large-scale murals. This work proposes a training-free approach for high-resolution image-to-video generation by leveraging precomputed low-resolution video latent trajectories as a global prior. During patch-based denoising, this prior is integrated to maintain spatiotemporal coherence. The method introduces a latent-space prior regularization mechanism that synergistically combines patch-based diffusion, trajectory upsampling, weighted least-squares fusion, and spatially regularized variables to jointly optimize global structure and local fidelity. It further enables region-level motion control. Evaluated on VBench-I2V and a newly curated mural dataset, the proposed method significantly outperforms existing patch-based baselines, achieving superior global consistency and detail preservation while maintaining computational efficiency.

Technology Category

Application Category

📝 Abstract

Diffusion-based image-to-video (I2V) models are increasingly effective, yet they struggle to scale to ultra-high-resolution inputs (e.g., 4K). Generating videos at the model's native resolution often loses fine-grained structure, whereas high-resolution tiled denoising preserves local detail but breaks global layout consistency. This failure mode is particularly severe in the fresco animation setting: monumental artworks containing many distinct characters, objects, and semantically different sub-scenes that must remain spatially coherent over time. We introduce FrescoDiffusion, a training-free method for coherent large-format I2V generation from a single complex image. The key idea is to augment tiled denoising with a precomputed latent prior: we first generate a low-resolution video at the underlying model resolution and upsample its latent trajectory to obtain a global reference that captures long-range temporal and spatial structure. For 4K generation, we compute per-tile noise predictions and fuse them with this reference at every diffusion timestep by minimizing a single weighted least-squares objective in model-output space. The objective combines a standard tile-merging criterion with our regularization term, yielding a closed-form fusion update that strengthens global coherence while retaining fine detail. We additionally provide a spatial regularization variable that enables region-level control over where motion is allowed. Experiments on the VBench-I2V dataset and our proposed fresco I2V dataset show improved global consistency and fidelity over tiled baselines, while being computationally efficient. Our regularization enables explicit controllability of the trade-off between creativity and consistency.

Problem

Research questions and friction points this paper is trying to address.

image-to-video

4K resolution

global consistency

tiled diffusion

fresco animation

Innovation

Methods, ideas, or system contributions that make the work stand out.

tiled diffusion

latent prior regularization

4K image-to-video