🤖 AI Summary
Existing 4D dynamic scene editing methods require retraining on thousands of 2D images, incurring prohibitive computational cost and poor scalability across time steps. This paper proposes a static-dynamic decoupled editing framework based on 4D Gaussian representations: the scene is decomposed into static 3D Gaussians and a Hexplane-parameterized deformation field, with editing applied exclusively to the static component. To correct misalignment in the deformation field induced by static edits, we introduce a score distillation fine-tuning mechanism. As the first decoupling paradigm for 4D scene editing, our approach drastically reduces computational overhead—cutting editing time by over 50%—while enabling high-fidelity, instruction-driven dynamic scene editing. It exhibits strong temporal scalability and significantly improves user controllability and editing consistency across time.
📝 Abstract
Recent 4D dynamic scene editing methods require editing thousands of 2D images used for dynamic scene synthesis and updating the entire scene with additional training loops, resulting in several hours of processing to edit a single dynamic scene. Therefore, these methods are not scalable with respect to the temporal dimension of the dynamic scene (i.e., the number of timesteps). In this work, we propose an efficient dynamic scene editing method that is more scalable in terms of temporal dimension. To achieve computational efficiency, we leverage a 4D Gaussian representation that models a 4D dynamic scene by combining static 3D Gaussians with a Hexplane-based deformation field, which handles dynamic information. We then perform editing solely on the static 3D Gaussians, which is the minimal but sufficient component required for visual editing. To resolve the misalignment between the edited 3D Gaussians and the deformation field potentially resulting from the editing process, we additionally conducted a refinement stage using a score distillation mechanism. Extensive editing results demonstrate that our method is efficient, reducing editing time by more than half compared to existing methods, while achieving high editing quality that better follows user instructions.