FreeViS: Training-free Video Stylization with Inconsistent References

📅 2025-10-02

📈 Citations: 0

✨ Influential: 0

career value

216K/year

🤖 AI Summary

Video stylization faces dual challenges of temporal inconsistency and limited stylistic expressiveness; existing approaches either induce flickering via frame-wise processing or require paired video data and costly training. This paper proposes a training-free video stylization framework built upon a pretrained image-to-video diffusion model. It introduces three key components: optical-flow-guided motion modeling, multi-reference image feature fusion, and high-frequency content compensation. To our knowledge, this is the first method achieving high-fidelity style transfer while preserving motion textures—entirely without fine-tuning or training. Quantitative and qualitative evaluations demonstrate that our approach significantly outperforms state-of-the-art methods in style fidelity, inter-frame consistency, and visual naturalness. User studies further confirm higher perceptual preference. By eliminating the need for training data and optimization, our framework establishes a novel paradigm for high-quality, low-cost video stylization.

Technology Category

Application Category

📝 Abstract

Video stylization plays a key role in content creation, but it remains a challenging problem. Naïvely applying image stylization frame-by-frame hurts temporal consistency and reduces style richness. Alternatively, training a dedicated video stylization model typically requires paired video data and is computationally expensive. In this paper, we propose FreeViS, a training-free video stylization framework that generates stylized videos with rich style details and strong temporal coherence. Our method integrates multiple stylized references to a pretrained image-to-video (I2V) model, effectively mitigating the propagation errors observed in prior works, without introducing flickers and stutters. In addition, it leverages high-frequency compensation to constrain the content layout and motion, together with flow-based motion cues to preserve style textures in low-saliency regions. Through extensive evaluations, FreeViS delivers higher stylization fidelity and superior temporal consistency, outperforming recent baselines and achieving strong human preference. Our training-free pipeline offers a practical and economic solution for high-quality, temporally coherent video stylization. The code and videos can be accessed via https://xujiacong.github.io/FreeViS/

Problem

Research questions and friction points this paper is trying to address.

Achieving temporal consistency in video stylization without training

Enhancing style richness while preventing flickers and stutters

Providing practical high-quality stylization without paired video data

Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free framework with inconsistent references

High-frequency compensation for layout and motion

Flow-based motion cues preserve style textures

🔎 Similar Papers

No similar papers found.

Apple

Cupertino, United States of America

AI Research Scientist, Computer Vision - Facebook Video Intelligence