🤖 AI Summary
This work addresses the challenges of identity preservation, multi-view consistency, and spatial semantic alignment in 3D editing. It proposes a two-stage optimization approach: first, orthogonal views are rendered and object-level segmentation masks are extracted, combined with multi-view textual inversion and attention alignment; subsequently, full-parameter fine-tuning of a multi-view diffusion model enables natural language-driven, object-level 3D editing. This method represents the first effective transfer of personalization capabilities from 2D diffusion models to the 3D domain, supporting disentangled semantic token composition and high-fidelity edits. Experiments demonstrate that the approach significantly outperforms existing methods across diverse scenarios, achieving state-of-the-art performance in both editing fidelity and identity preservation.
📝 Abstract
While 2D diffusion models have achieved remarkable success in identity-preserving personalization, extending this capability to 3D assets remains a significant challenge due to the complexities of multi-view consistency and spatial control. Inspired by these 2D advancements, we present a novel personalization method for text-guided 3D editing that enables compositional, object-level control through natural language. Given a 3D input, we render orthogonal views and extract object-level segmentation masks to isolate semantic components. We then learn distinct token embeddings for each component through a tailored two-phase optimization strategy: multi-view textual inversion with attention alignment, followed by full fine-tuning of multi-view diffusion model. During inference, these disentangled tokens seamlessly compose with editing prompts to generate multi-view consistent images, which are subsequently lifted into high-fidelity textured 3D meshes. Extensive evaluations across diverse editing scenarios demonstrate that our method successfully transfers the flexibility of 2D personalization to 3D, achieving state-of-the-art edit faithfulness and identity preservation compared to existing baselines.