🤖 AI Summary
This work proposes an agent-based framework for high-quality, text-guided 3D editing that operates directly in the latent space of native 3D generative models. Existing methods struggle to simultaneously interpret complex textual instructions, accurately localize editing regions, and preserve the integrity of unedited content. To address this, the proposed approach leverages a multimodal large language model to parse user instructions, identify target regions, and determine edit types, then integrates an image editing model with an inversion-based latent inpainting pipeline to execute the edits. Notably, this is the first method to achieve mask-free, high-fidelity, text-driven 3D editing while maintaining 3D consistency, supporting joint modifications of both structure and appearance. Experiments demonstrate significant improvements over state-of-the-art methods in both automatic metrics and human evaluations, yielding results that are more accurate, diverse, and coherent.
📝 Abstract
Text-guided 3D editing aims to modify existing 3D assets using natural-language instructions. Current methods struggle to jointly understand complex prompts, automatically localize edits in 3D, and preserve unedited content. We introduce Vinedresser3D, an agentic framework for high-quality text-guided 3D editing that operates directly in the latent space of a native 3D generative model. Given a 3D asset and an editing prompt, Vinedresser3D uses a multimodal large language model to infer rich descriptions of the original asset, identify the edit region and edit type (addition, modification, deletion), and generate decomposed structural and appearance-level text guidance. The agent then selects an informative view and applies an image editing model to obtain visual guidance. Finally, an inversion-based rectified-flow inpainting pipeline with an interleaved sampling module performs editing in the 3D latent space, enforcing prompt alignment while maintaining 3D coherence and unedited regions. Experiments on diverse 3D edits demonstrate that Vinedresser3D outperforms prior baselines in both automatic metrics and human preference studies, while enabling precise, coherent, and mask-free 3D editing.