Versatile Transition Generation with Image-to-Video Diffusion

📅 2025-08-03

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the problem of generating transitional videos conditioned on initial/final frames and text prompts, aiming for semantically coherent, temporally smooth, and photorealistic intermediate frames. The proposed general framework comprises three key components: (1) interpolation-based frame initialization; (2) bidirectional motion refinement with representation alignment regularization to explicitly enforce object identity consistency and temporal coherence; and (3) a diffusion-based image-to-video generator jointly conditioned on text, keyframes, structural guidance (e.g., edge maps), and motion trajectories, with explicit temporal consistency optimization. Evaluated on TransitBench—a newly introduced benchmark comprising four challenging transition tasks—the method achieves state-of-the-art performance across all metrics, significantly improving conceptual fusion accuracy and enhancing both visual smoothness and semantic plausibility of scene transitions.

Technology Category

Application Category

📝 Abstract

Leveraging text, images, structure maps, or motion trajectories as conditional guidance, diffusion models have achieved great success in automated and high-quality video generation. However, generating smooth and rational transition videos given the first and last video frames as well as descriptive text prompts is far underexplored. We present VTG, a Versatile Transition video Generation framework that can generate smooth, high-fidelity, and semantically coherent video transitions. VTG introduces interpolation-based initialization that helps preserve object identity and handle abrupt content changes effectively. In addition, it incorporates dual-directional motion fine-tuning and representation alignment regularization to mitigate the limitations of pre-trained image-to-video diffusion models in motion smoothness and generation fidelity, respectively. To evaluate VTG and facilitate future studies on unified transition generation, we collected TransitBench, a comprehensive benchmark for transition generation covering two representative transition tasks: concept blending and scene transition. Extensive experiments show that VTG achieves superior transition performance consistently across all four tasks.

Problem

Research questions and friction points this paper is trying to address.

Generating smooth transitions between video frames

Preserving object identity during content changes

Improving motion smoothness and generation fidelity

Innovation

Methods, ideas, or system contributions that make the work stand out.

Interpolation-based initialization for identity preservation

Dual-directional motion fine-tuning for smoothness

Representation alignment regularization for fidelity

🔎 Similar Papers

Generative Inbetweening: Adapting Image-to-Video Models for Keyframe Interpolation

2024-08-27arXiv.orgCitations: 7

Authors to Follow