π€ AI Summary
This work addresses the fundamental challenge in video and 3D scene editing: reconciling *temporal and geometric consistency* with *large-scale geometric modifications*. To this end, we propose a training-free, instruction-driven editing framework. Methodologically, it employs progressive subtask decomposition and a render-edit-reconstruct pipeline, introducing a novel tripleεε control mechanism: (i) initial-noise anchoring to preserve structural priors, (ii) stepwise noise modulation for controlled geometric evolution, and (iii) cross-attention guidance between text and video features to enforce semantic alignment. This is the first approach enabling 3D spatiotemporal-consistent editing under *geometrically significant transformations*. Evaluated on multiple video editing and complex 3D scene benchmarks, our method achieves state-of-the-art performance, delivering high-fidelity, geometrically plausible, and spatiotemporally coherent edits without requiring fine-tuning or domain-specific training.
π Abstract
This paper introduces V$^2$Edit, a novel training-free framework for instruction-guided video and 3D scene editing. Addressing the critical challenge of balancing original content preservation with editing task fulfillment, our approach employs a progressive strategy that decomposes complex editing tasks into a sequence of simpler subtasks. Each subtask is controlled through three key synergistic mechanisms: the initial noise, noise added at each denoising step, and cross-attention maps between text prompts and video content. This ensures robust preservation of original video elements while effectively applying the desired edits. Beyond its native video editing capability, we extend V$^2$Edit to 3D scene editing via a"render-edit-reconstruct"process, enabling high-quality, 3D-consistent edits even for tasks involving substantial geometric changes such as object insertion. Extensive experiments demonstrate that our V$^2$Edit achieves high-quality and successful edits across various challenging video editing tasks and complex 3D scene editing tasks, thereby establishing state-of-the-art performance in both domains.