🤖 AI Summary
Existing VFX generation methods rely on a “one-effect-one-LoRA” paradigm, suffering from high computational resource consumption and poor generalization. This paper introduces the first context-learning-based framework for dynamic visual effects generation, formulating effect transfer as a reference-video-guided, context-conditioned generation task. Our key contributions are: (1) a context-aware attention masking mechanism that enables multi-effect disentanglement and leakage-free conditional injection within a single diffusion model; and (2) single-shot adaptation capability, allowing rapid generalization to unseen effect categories without retraining. Experiments demonstrate high-fidelity reproduction across diverse dynamic effects—including motion blur, lens flare, and particle simulations—and significant performance gains over baselines on out-of-domain effects. All code, pretrained models, and datasets are publicly released.
📝 Abstract
Visual effects (VFX) are crucial to the expressive power of digital media, yet their creation remains a major challenge for generative AI. Prevailing methods often rely on the one-LoRA-per-effect paradigm, which is resource-intensive and fundamentally incapable of generalizing to unseen effects, thus limiting scalability and creation. To address this challenge, we introduce VFXMaster, the first unified, reference-based framework for VFX video generation. It recasts effect generation as an in-context learning task, enabling it to reproduce diverse dynamic effects from a reference video onto target content. In addition, it demonstrates remarkable generalization to unseen effect categories. Specifically, we design an in-context conditioning strategy that prompts the model with a reference example. An in-context attention mask is designed to precisely decouple and inject the essential effect attributes, allowing a single unified model to master the effect imitation without information leakage. In addition, we propose an efficient one-shot effect adaptation mechanism to boost generalization capability on tough unseen effects from a single user-provided video rapidly. Extensive experiments demonstrate that our method effectively imitates various categories of effect information and exhibits outstanding generalization to out-of-domain effects. To foster future research, we will release our code, models, and a comprehensive dataset to the community.