🤖 AI Summary
This work addresses the challenge of achieving fine-grained, part-level, high-fidelity editing in 3D scenes while preserving structural integrity. The authors propose a mesh-guided 3D Gaussian Splatting (3DGS) editing framework that aligns video diffusion models with explicit 3D geometry to enable automated, high-fidelity manipulation. Key contributions include the introduction of MV-TRACE—the first dataset supporting multi-view consistency—alongside novel mechanisms: Touchable Geometry Anchoring (TGA) and Contextual Video Masking (CVM). The method employs a three-stage pipeline integrating 3D reconstruction, two-stage registration, and autoregressive video generation. Experiments demonstrate that the approach significantly outperforms existing methods in both editing flexibility and structural coherence, producing temporally consistent and physically plausible 3D scene edits with high fidelity.
📝 Abstract
We present TRACE, a mesh-guided 3DGS editing framework that achieves automated, high-fidelity scene transformation. By anchoring video diffusion with explicit 3D geometry, TRACE uniquely enables fine-grained, part-level manipulatio--such as local pose shifting or component replacemen--while preserving the structural integrity of the central subject, a capability largely absent in existing editing methods. Our approach comprises three key stages: (1) Multi-view 3D-Anchor Synthesis, which leverages a sparse-view editor trained on our MV-TRACE datase--the first multi-view consistent dataset dedicated to scene-coherent object addition and modificatio--to generate spatially consistent 3D-anchors; (2) Tangible Geometry Anchoring (TGA), which ensures precise spatial synchronization between inserted meshes and the 3DGS scene via two-phase registration; and (3) Contextual Video Masking (CVM), which integrates 3D projections into an autoregressive video pipeline to achieve temporally stable, physically-grounded rendering. Extensive experiments demonstrate that TRACE consistently outperforms existing methods especially in editing versatility and structural integrity.