🤖 AI Summary
Video stylization faces dual challenges of temporal inconsistency and limited stylistic expressiveness; existing approaches either induce flickering via frame-wise processing or require paired video data and costly training. This paper proposes a training-free video stylization framework built upon a pretrained image-to-video diffusion model. It introduces three key components: optical-flow-guided motion modeling, multi-reference image feature fusion, and high-frequency content compensation. To our knowledge, this is the first method achieving high-fidelity style transfer while preserving motion textures—entirely without fine-tuning or training. Quantitative and qualitative evaluations demonstrate that our approach significantly outperforms state-of-the-art methods in style fidelity, inter-frame consistency, and visual naturalness. User studies further confirm higher perceptual preference. By eliminating the need for training data and optimization, our framework establishes a novel paradigm for high-quality, low-cost video stylization.
📝 Abstract
Video stylization plays a key role in content creation, but it remains a challenging problem. Naïvely applying image stylization frame-by-frame hurts temporal consistency and reduces style richness. Alternatively, training a dedicated video stylization model typically requires paired video data and is computationally expensive. In this paper, we propose FreeViS, a training-free video stylization framework that generates stylized videos with rich style details and strong temporal coherence. Our method integrates multiple stylized references to a pretrained image-to-video (I2V) model, effectively mitigating the propagation errors observed in prior works, without introducing flickers and stutters. In addition, it leverages high-frequency compensation to constrain the content layout and motion, together with flow-based motion cues to preserve style textures in low-saliency regions. Through extensive evaluations, FreeViS delivers higher stylization fidelity and superior temporal consistency, outperforming recent baselines and achieving strong human preference. Our training-free pipeline offers a practical and economic solution for high-quality, temporally coherent video stylization. The code and videos can be accessed via https://xujiacong.github.io/FreeViS/