🤖 AI Summary
Existing video transition methods struggle to generate content-aware and visually coherent transitional frames for clips with large temporal gaps or high semantic discrepancies. To address this, we propose a zero-shot structure-aware video transition method inspired by artistic practice, featuring a joint sketch-motion alignment mechanism that synergistically leverages edge-structure maps and optical flow fields to guide frame synthesis—without fine-tuning—enabling high-fidelity intermediate-frame generation across heterogeneous clips. Our core innovation lies in embedding structural priors (edge sketches) and dynamic priors (optical flow) into a diffusion model, establishing a disentangled spatiotemporal constraint framework. Extensive experiments demonstrate that our method significantly outperforms state-of-the-art approaches—including FILM, TVG, and DiffMorpher—across quantitative metrics (PSNR, LPIPS, FVD) and user preference rates, achieving breakthrough improvements in visual naturalness and temporal consistency.
📝 Abstract
Video transitions aim to synthesize intermediate frames between two clips, but naive approaches such as linear blending introduce artifacts that limit professional use or break temporal coherence. Traditional techniques (cross-fades, morphing, frame interpolation) and recent generative inbetweening methods can produce high-quality plausible intermediates, but they struggle with bridging diverse clips involving large temporal gaps or significant semantic differences, leaving a gap for content-aware and visually coherent transitions. We address this challenge by drawing on artistic workflows, distilling strategies such as aligning silhouettes and interpolating salient features to preserve structure and perceptual continuity. Building on this, we propose SAGE (Structure-Aware Generative vidEo transitions) as a zeroshot approach that combines structural guidance, provided via line maps and motion flow, with generative synthesis, enabling smooth, semantically consistent transitions without fine-tuning. Extensive experiments and comparison with current alternatives, namely [FILM, TVG, DiffMorpher, VACE, GI], demonstrate that SAGE outperforms both classical and generative baselines on quantitative metrics and user studies for producing transitions between diverse clips. Code to be released on acceptance.