Image-to-Video Diffusion: From Foundations to Open Frontiers

📅 2026-05-17
📈 Citations: 0
Influential: 0
📄 PDF

career value

219K/year
🤖 AI Summary
Image-to-video (I2V) generation faces core challenges in content consistency, identity preservation, and motion coherence, yet lacks a systematic survey and unified taxonomy. This work establishes diffusion-based I2V generation as a distinct research direction and introduces a dedicated classification framework, identifying four key design components: condition encoding, temporal modeling, noise prior design, and spatiotemporal upsampling. By systematically organizing existing approaches according to model architectures and training paradigms, the paper clarifies task formulations, datasets, evaluation metrics, and methodological advances. It further delineates critical technical pathways, representative application scenarios, and open challenges, thereby promoting standardization and systematic progress in the field.
📝 Abstract
Diffusion-based \textit{image-to-video} (I2V) generation has become a central direction in generative models by turning a reference image, with optional conditions, into a temporally coherent video. Compared with broader video generation settings, this task places stricter demands on content consistency, identity preservation, and motion coherence. Although the literature grows rapidly, existing works mostly discuss I2V generation within broader topics and still lack a dedicated taxonomy together with a systematic analysis centered on this field. This work addresses that gap by treating diffusion I2V generation as a standalone subject. It first reviews the task formulation, model architectures, datasets, and evaluation metrics, and then organizes existing methods through a taxonomy based on architecture and training paradigm. It further distills four core designs, namely condition encoding, temporal modeling, noise prior design, and spatial-temporal upsampling, and discusses representative application scenarios together with major open challenges.
Problem

Research questions and friction points this paper is trying to address.

image-to-video
diffusion models
generative models
temporal coherence
content consistency
Innovation

Methods, ideas, or system contributions that make the work stand out.

image-to-video
diffusion models
temporal coherence
condition encoding
taxonomy
🔎 Similar Papers
No similar papers found.
Xianlong Wang
Xianlong Wang
Ph.D. student, City University of Hong Kong
Trustworthy LLM/VLMEmbodied AIUnlearnable Example3D Point CloudPoisoning/Adversarial Attack
W
Wenbo Pan
Department of Computer Science, City University of Hong Kong, Hong Kong SAR, China
S
Shijia Zhou
School of Cyber Science and Engineering, Huazhong University of Science and Technology, Wuhan 430074, China
K
Ke Li
School of Cyber Science and Technology, University of Science and Technology of China, Hefei 230026, China
Yuqi Wang
Yuqi Wang
The HongKong University of Science and Technology
Oxide SemiconductorDevice ReliabilityProcess Integration
Z
Zeyu Ye
School of Computer Science, Xiangtan University, XiangTan 411105, China
Hangtao Zhang
Hangtao Zhang
Huazhong University of Science and Technology (HUST)
AI Security
L
Leo Yu Zhang
School of Information and Communication Technology, Griffith University, Southport, QLD 4215, Australia
Xiaohua Jia
Xiaohua Jia
Chinese Academy of Science