๐ค AI Summary
This paper addresses the challenges of temporal incoherence and cross-view inconsistency in instruction-driven editing of 4D scenes (spatiotemporal + multi-view). We propose PSF-4D, a progressive sampling framework that requires no external models. Its core contributions are: (i) the first introduction of correlated Gaussian noise modeling to explicitly enforce inter-frame temporal consistency; (ii) a novel cross-view shared-independent noise decomposition mechanism, coupled with view-aware iterative refinement, enabling joint optimization of temporal, spatial, and view consistency within a unified diffusion framework; and (iii) support for diverse editing tasksโincluding style transfer, multi-attribute editing, object removal, and local editing. Extensive experiments demonstrate that PSF-4D consistently outperforms state-of-the-art methods in editing fidelity, temporal coherence, and multi-view consistency.
๐ Abstract
Instruction-guided generative models, especially those using text-to-image (T2I) and text-to-video (T2V) diffusion frameworks, have advanced the field of content editing in recent years. To extend these capabilities to 4D scene, we introduce a progressive sampling framework for 4D editing (PSF-4D) that ensures temporal and multi-view consistency by intuitively controlling the noise initialization during forward diffusion. For temporal coherence, we design a correlated Gaussian noise structure that links frames over time, allowing each frame to depend meaningfully on prior frames. Additionally, to ensure spatial consistency across views, we implement a cross-view noise model, which uses shared and independent noise components to balance commonalities and distinct details among different views. To further enhance spatial coherence, PSF-4D incorporates view-consistent iterative refinement, embedding view-aware information into the denoising process to ensure aligned edits across frames and views. Our approach enables high-quality 4D editing without relying on external models, addressing key challenges in previous methods. Through extensive evaluation on multiple benchmarks and multiple editing aspects (e.g., style transfer, multi-attribute editing, object removal, local editing, etc.), we show the effectiveness of our proposed method. Experimental results demonstrate that our proposed method outperforms state-of-the-art 4D editing methods in diverse benchmarks.