🤖 AI Summary
Existing video style transfer methods are often hindered by the scarcity of large-scale triplet data and effective modeling paradigms, leading to temporal inconsistency, fragile handling of occlusions, and flickering artifacts. To address these limitations, this work introduces VISTA-1000, the first large-scale synthetic dataset with aligned style, content, and motion, encompassing 1,000 distinct artistic styles. Building upon this dataset, we propose a context-aware transfer framework based on diffusion Transformers, augmented with a lightweight style adapter for robust style representation. By integrating joint modeling with a disentanglement strategy, our approach significantly outperforms existing methods in terms of style fidelity, temporal coherence, and content preservation, effectively suppressing flickering and drift artifacts.
📝 Abstract
Video style transfer aims to render videos in a target artistic style while preserving content, structure, and motion. While image stylization has advanced rapidly, video stylization remains challenging due to temporal inconsistency. Most existing methods stylize frames or keyframes and enforce consistency via heuristic temporal propagation, which is brittle under occlusions, disocclusions, and long-term motion, leading to drift and flickering artifacts. We argue that a fundamental bottleneck lies in the lack of large-scale triplet data and a principled training paradigm that jointly models and disentangles style, content, and motion.To address this, we introduce VISTA-1000, a synthetic dataset with 1,000 styles and motion-aligned triplets of style reference, clean video, and stylized video, and propose a diffusion-transformer-based in-context video style transfer framework with a lightweight style adapter for robust style extraction. Extensive experiments demonstrate SOTA performance in style fidelity, temporal consistency, and content preservation.