Sparkle: Realizing Lively Instruction-Guided Video Background Replacement via Decoupled Guidance

📅 2026-05-07

📈 Citations: 0

✨ Influential: 0

career value

167K/year

🤖 AI Summary

Existing open-source video editing datasets struggle to support high-quality, temporally consistent instruction-guided background replacement, often resulting in static and unnatural backgrounds. To address this limitation, this work proposes a decoupled foreground-background generative framework that integrates natural language instruction alignment, explicit modeling of temporal consistency, and automated quality filtering to establish a scalable pipeline for high-fidelity synthesis. The study introduces Sparkle, the first large-scale dataset specifically designed for background replacement, comprising approximately 140,000 video pairs, along with Sparkle-Bench, a dedicated evaluation benchmark. Experimental results demonstrate that models trained using the proposed approach significantly outperform current state-of-the-art methods across multiple quantitative and qualitative metrics.

📝 Abstract

In recent years, open-source efforts like Senorita-2M have propelled video editing toward natural language instruction. However, current publicly available datasets predominantly focus on local editing or style transfer, which largely preserve the original scene structure and are easier to scale. In contrast, Background Replacement, a task central to creative applications such as film production and advertising, requires synthesizing entirely new, temporally consistent scenes while maintaining accurate foreground-background interactions, making large-scale data generation significantly more challenging. Consequently, this complex task remains largely underexplored due to a scarcity of high-quality training data. This gap is evident in poorly performing state-of-the-art models, e.g., Kiwi-Edit, because the primary open-source dataset that contains this task, i.e., OpenVE-3M, frequently produces static, unnatural backgrounds. In this paper, we trace this quality degradation to a lack of precise background guidance during data synthesis. Accordingly, we design a scalable pipeline that generates foreground and background guidance in a decoupled manner with strict quality filtering. Building on this pipeline, we introduce Sparkle, a dataset of ~140K video pairs spanning five common background-change themes, alongside Sparkle-Bench, the largest evaluation benchmark tailored for background replacement to date. Experiments demonstrate that our dataset and the model trained on it achieve substantially better performance than all existing baselines on both OpenVE-Bench and Sparkle-Bench. Our proposed dataset, benchmark, and model are fully open-sourced at https://showlab.github.io/Sparkle/.

Problem

Research questions and friction points this paper is trying to address.

Background Replacement

Video Editing

Instruction-Guided Generation

Temporal Consistency

Training Data Scarcity

Innovation

Methods, ideas, or system contributions that make the work stand out.

decoupled guidance

video background replacement

instruction-guided editing