DreamStyle: A Unified Framework for Video Stylization

📅 2026-01-06

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

193K/year

🤖 AI Summary

This work proposes DreamStyle, the first unified framework for video stylization that simultaneously supports guidance from text prompts, style images, and the first frame. Existing methods typically accommodate only a single conditioning modality and suffer from inconsistent stylization and temporal flickering due to low-quality paired training data. To address these limitations, DreamStyle leverages a newly curated high-quality paired video dataset and introduces a LoRA-based fine-tuning strategy with token-specific up-projection matrices, effectively disentangling multimodal conditioning signals. This approach preserves the underlying image-to-video model architecture while significantly enhancing both style consistency and temporal stability. Extensive experiments demonstrate that DreamStyle outperforms current state-of-the-art methods across all three stylization tasks, with both quantitative metrics and qualitative results confirming its superiority.

Technology Category

Application Category

📝 Abstract

Video stylization, an important downstream task of video generation models, has not yet been thoroughly explored. Its input style conditions typically include text, style image, and stylized first frame. Each condition has a characteristic advantage: text is more flexible, style image provides a more accurate visual anchor, and stylized first frame makes long-video stylization feasible. However, existing methods are largely confined to a single type of style condition, which limits their scope of application. Additionally, their lack of high-quality datasets leads to style inconsistency and temporal flicker. To address these limitations, we introduce DreamStyle, a unified framework for video stylization, supporting (1) text-guided, (2) style-image-guided, and (3) first-frame-guided video stylization, accompanied by a well-designed data curation pipeline to acquire high-quality paired video data. DreamStyle is built on a vanilla Image-to-Video (I2V) model and trained using a Low-Rank Adaptation (LoRA) with token-specific up matrices that reduces the confusion among different condition tokens. Both qualitative and quantitative evaluations demonstrate that DreamStyle is competent in all three video stylization tasks, and outperforms the competitors in style consistency and video quality.

Problem

Research questions and friction points this paper is trying to address.

video stylization

style consistency

temporal flicker

multi-condition guidance

high-quality dataset

Innovation

Methods, ideas, or system contributions that make the work stand out.

unified video stylization

Low-Rank Adaptation (LoRA)

multi-condition guidance

style consistency

high-quality video dataset

🔎 Similar Papers

No similar papers found.