🤖 AI Summary
This work addresses the challenge of simultaneously preserving global consistency and local fidelity in natural language-guided fine-grained local editing of 3D point clouds. We propose a diffusion-based, text-conditioned local editing framework that introduces inference-time coordinate mixing—a novel technique that jointly optimizes full-cloud reconstruction and local inpainting across multiple noise levels, thereby circumventing distortion-prone inversion procedures. Our method integrates point cloud completion, a frozen text encoder, and a local shape-constrained inpainting mechanism to explicitly preserve identity within the edited region. Experiments demonstrate significant improvements over state-of-the-art methods in text–shape alignment, local detail fidelity, and global structural coherence. To the best of our knowledge, this is the first approach enabling high-precision, highly controllable, and structurally consistent natural language-guided local editing of 3D point clouds.
📝 Abstract
Natural language offers a highly intuitive interface for enabling localized fine-grained edits of 3D shapes. However, prior works face challenges in preserving global coherence while locally modifying the input 3D shape. In this work, we introduce an inpainting-based framework for editing shapes represented as point clouds. Our approach leverages foundation 3D diffusion models for achieving localized shape edits, adding structural guidance in the form of a partial conditional shape, ensuring that other regions correctly preserve the shape's identity. Furthermore, to encourage identity preservation also within the local edited region, we propose an inference-time coordinate blending algorithm which balances reconstruction of the full shape with inpainting at a progression of noise levels during the inference process. Our coordinate blending algorithm seamlessly blends the original shape with its edited version, enabling a fine-grained editing of 3D shapes, all while circumventing the need for computationally expensive and often inaccurate inversion. Extensive experiments show that our method outperforms alternative techniques across a wide range of metrics that evaluate both fidelity to the original shape and also adherence to the textual description.