iMontage: Unified, Versatile, Highly Dynamic Many-to-many Image Generation

📅 2025-11-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the longstanding challenge of balancing temporal coherence and content diversity in image generation. We propose a lightweight adaptation framework that injects static image data into a pre-trained video diffusion model, preserving its inherent motion priors while endowing it with cross-image contextual consistency modeling and enhanced dynamic-range generation capabilities. Our approach leverages customized dataset construction, minimally invasive architectural modifications, and efficient fine-tuning—enabling, for the first time, the transfer of video diffusion models to general-purpose many-to-many image-set generation tasks. Extensive experiments demonstrate significant improvements over state-of-the-art image and video generation baselines across variable-length image sequence generation, editing, and diverse multi-image collaborative tasks. Generated outputs exhibit high fidelity, natural temporal transitions, and exceptional semantic–dynamic diversity.

Technology Category

Application Category

📝 Abstract
Pre-trained video models learn powerful priors for generating high-quality, temporally coherent content. While these models excel at temporal coherence, their dynamics are often constrained by the continuous nature of their training data. We hypothesize that by injecting the rich and unconstrained content diversity from image data into this coherent temporal framework, we can generate image sets that feature both natural transitions and a far more expansive dynamic range. To this end, we introduce iMontage, a unified framework designed to repurpose a powerful video model into an all-in-one image generator. The framework consumes and produces variable-length image sets, unifying a wide array of image generation and editing tasks. To achieve this, we propose an elegant and minimally invasive adaptation strategy, complemented by a tailored data curation process and training paradigm. This approach allows the model to acquire broad image manipulation capabilities without corrupting its invaluable original motion priors. iMontage excels across several mainstream many-in-many-out tasks, not only maintaining strong cross-image contextual consistency but also generating scenes with extraordinary dynamics that surpass conventional scopes. Find our homepage at: https://kr1sjfu.github.io/iMontage-web/.
Problem

Research questions and friction points this paper is trying to address.

Generating image sets with natural transitions and expanded dynamics
Repurposing video models for versatile many-to-many image generation
Maintaining motion priors while acquiring broad image manipulation capabilities
Innovation

Methods, ideas, or system contributions that make the work stand out.

Repurposing video model for image generation
Minimally invasive adaptation preserving motion priors
Unified framework for diverse image manipulation tasks
🔎 Similar Papers
No similar papers found.