🤖 AI Summary
To address unnatural transitions and imprecise style adaptation in video style transfer, this work formulates visual transition modeling as a style-aware temporal generation task—the first of its kind. We propose a multi-granularity style alignment mechanism and explicit temporal consistency constraints, implemented via a Transformer-based cross-modal style encoder, a differentiable transition synthesis module, and an adversarial style discriminator. Our method enables end-to-end transition recommendation and synthesis from source videos to target styles—including documentary, narrative film, or specific YouTube channel aesthetics. Evaluated on a professional video dataset, it achieves a 23.6% improvement in transition-type recommendation accuracy and attains a user preference score of 4.72/5.0—significantly outperforming state-of-the-art methods. The core contribution lies in establishing a joint style-temporal modeling paradigm that simultaneously ensures semantic coherence and visual smoothness.