IC-Effect: Precise and Efficient Video Effects Editing via In-Context Learning

📅 2025-12-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Video special effects editing faces three core challenges: seamless integration of effects with background, strict background fidelity, and efficient modeling of effect patterns from sparse paired data—objectives rarely satisfied simultaneously by existing methods. This paper proposes an instruction-driven few-shot video special effects editing framework capable of injecting complex effects—including fire, particles, and cartoon characters—while preserving background invariance and spatiotemporal consistency. Key contributions include: (1) a novel DiT architecture conditioned on the source video as a clean contextual prior; (2) an Effect-LoRA two-stage fine-tuning strategy for disentangled effect learning; and (3) a spatiotemporally sparse tokenization mechanism that significantly improves computational efficiency. Evaluated on our newly constructed 15-style VFX paired dataset, our method outperforms state-of-the-art approaches in fidelity, controllability, and temporal consistency. The benchmark dataset is publicly released.

Technology Category

Application Category

📝 Abstract
We propose extbf{IC-Effect}, an instruction-guided, DiT-based framework for few-shot video VFX editing that synthesizes complex effects (eg flames, particles and cartoon characters) while strictly preserving spatial and temporal consistency. Video VFX editing is highly challenging because injected effects must blend seamlessly with the background, the background must remain entirely unchanged, and effect patterns must be learned efficiently from limited paired data. However, existing video editing models fail to satisfy these requirements. IC-Effect leverages the source video as clean contextual conditions, exploiting the contextual learning capability of DiT models to achieve precise background preservation and natural effect injection. A two-stage training strategy, consisting of general editing adaptation followed by effect-specific learning via Effect-LoRA, ensures strong instruction following and robust effect modeling. To further improve efficiency, we introduce spatiotemporal sparse tokenization, enabling high fidelity with substantially reduced computation. We also release a paired VFX editing dataset spanning $15$ high-quality visual styles. Extensive experiments show that IC-Effect delivers high-quality, controllable, and temporally consistent VFX editing, opening new possibilities for video creation.
Problem

Research questions and friction points this paper is trying to address.

Enables few-shot video VFX editing with complex effects
Preserves spatial and temporal consistency in edited videos
Learns effect patterns efficiently from limited paired data
Innovation

Methods, ideas, or system contributions that make the work stand out.

DiT-based framework for few-shot video VFX editing
Two-stage training with Effect-LoRA for robust modeling
Spatiotemporal sparse tokenization reduces computation efficiently
🔎 Similar Papers
Y
Yuanhang Li
School of Information and Communication Engineering, Communication University of China
Yiren Song
Yiren Song
PH.D student, National University of Singapore
Generative AIDiffusionUnified model
J
Junzhe Bai
School of Information and Communication Engineering, Communication University of China
X
Xinran Liang
School of Information and Communication Engineering, Communication University of China
H
Hu Yang
Baidu Inc., Beijing, China
L
Libiao Jin
School of Information and Communication Engineering, Communication University of China
Q
Qi Mao
School of Information and Communication Engineering, Communication University of China