PickStyle: Video-to-Video Style Transfer with Context-Style Adapters

📅 2025-10-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the lack of paired video supervision in video style transfer, this paper proposes a diffusion-based approach requiring only image-level supervision—without video-level paired data. The method introduces two key innovations: (1) a Context-Style Decoupled Classifier-Free Guidance (CS-CFG) mechanism that explicitly disentangles content and style control; and (2) a lightweight low-rank Style Adapter embedded into the attention layers of the diffusion model for efficient style injection. By modeling motion priors through synthesized dynamic image sequences, the approach transfers static image supervision to the video domain while preserving temporal coherence. Extensive experiments on multiple benchmarks demonstrate superior performance over state-of-the-art methods in content fidelity, style faithfulness, and motion consistency. Both qualitative and quantitative evaluations confirm the effectiveness and robustness of the proposed framework.

Technology Category

Application Category

📝 Abstract
We address the task of video style transfer with diffusion models, where the goal is to preserve the context of an input video while rendering it in a target style specified by a text prompt. A major challenge is the lack of paired video data for supervision. We propose PickStyle, a video-to-video style transfer framework that augments pretrained video diffusion backbones with style adapters and benefits from paired still image data with source-style correspondences for training. PickStyle inserts low-rank adapters into the self-attention layers of conditioning modules, enabling efficient specialization for motion-style transfer while maintaining strong alignment between video content and style. To bridge the gap between static image supervision and dynamic video, we construct synthetic training clips from paired images by applying shared augmentations that simulate camera motion, ensuring temporal priors are preserved. In addition, we introduce Context-Style Classifier-Free Guidance (CS-CFG), a novel factorization of classifier-free guidance into independent text (style) and video (context) directions. CS-CFG ensures that context is preserved in generated video while the style is effectively transferred. Experiments across benchmarks show that our approach achieves temporally coherent, style-faithful, and content-preserving video translations, outperforming existing baselines both qualitatively and quantitatively.
Problem

Research questions and friction points this paper is trying to address.

Preserving video content while transferring text-specified artistic styles
Overcoming lack of paired video data for style transfer supervision
Maintaining temporal coherence when applying image-trained models to videos
Innovation

Methods, ideas, or system contributions that make the work stand out.

Low-rank adapters enable efficient motion-style transfer
Synthetic video clips bridge static image supervision gap
Context-Style Classifier-Free Guidance preserves content while transferring style
🔎 Similar Papers
No similar papers found.
S
Soroush Mehraban
Pickford AI, University of Toronto, Vector Institute
V
Vida Adeli
Pickford AI, University of Toronto, Vector Institute
J
Jacob Rommann
Pickford AI
Babak Taati
Babak Taati
KITE Research Institute |Toronto Rehab - UHN & Department of Computer Science, University of Toronto
Computer VisionHealth MonitoringAmbient Intelligence
K
Kyryl Truskovskyi
Pickford AI