SAGE: Structure-Aware Generative Video Transitions between Diverse Clips

📅 2025-10-28

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

Existing video transition methods struggle to generate content-aware and visually coherent transitional frames for clips with large temporal gaps or high semantic discrepancies. To address this, we propose a zero-shot structure-aware video transition method inspired by artistic practice, featuring a joint sketch-motion alignment mechanism that synergistically leverages edge-structure maps and optical flow fields to guide frame synthesis—without fine-tuning—enabling high-fidelity intermediate-frame generation across heterogeneous clips. Our core innovation lies in embedding structural priors (edge sketches) and dynamic priors (optical flow) into a diffusion model, establishing a disentangled spatiotemporal constraint framework. Extensive experiments demonstrate that our method significantly outperforms state-of-the-art approaches—including FILM, TVG, and DiffMorpher—across quantitative metrics (PSNR, LPIPS, FVD) and user preference rates, achieving breakthrough improvements in visual naturalness and temporal consistency.

Technology Category

Application Category

📝 Abstract

Video transitions aim to synthesize intermediate frames between two clips, but naive approaches such as linear blending introduce artifacts that limit professional use or break temporal coherence. Traditional techniques (cross-fades, morphing, frame interpolation) and recent generative inbetweening methods can produce high-quality plausible intermediates, but they struggle with bridging diverse clips involving large temporal gaps or significant semantic differences, leaving a gap for content-aware and visually coherent transitions. We address this challenge by drawing on artistic workflows, distilling strategies such as aligning silhouettes and interpolating salient features to preserve structure and perceptual continuity. Building on this, we propose SAGE (Structure-Aware Generative vidEo transitions) as a zeroshot approach that combines structural guidance, provided via line maps and motion flow, with generative synthesis, enabling smooth, semantically consistent transitions without fine-tuning. Extensive experiments and comparison with current alternatives, namely [FILM, TVG, DiffMorpher, VACE, GI], demonstrate that SAGE outperforms both classical and generative baselines on quantitative metrics and user studies for producing transitions between diverse clips. Code to be released on acceptance.

Problem

Research questions and friction points this paper is trying to address.

Bridging large temporal gaps between diverse video clips

Preserving structural coherence during video transitions

Generating semantically consistent transitions without fine-tuning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Structure-aware generative video transitions without fine-tuning

Combines structural guidance with generative synthesis

Uses line maps and motion flow for smooth transitions

🔎 Similar Papers

No similar papers found.

Apple

Cupertino, United States of America

AI Research Scientist, Computer Vision - Facebook Video Intelligence