🤖 AI Summary
Existing 3D editing methods are hindered by the scarcity of high-quality paired supervision data, leading to suboptimal performance in geometry preservation, multi-view consistency, and local controllability. This work introduces semantic parts as the fundamental editing units and proposes a part-based transformation supervision paradigm. The authors construct Pxform, a high-quality dataset comprising 100,000 paired samples, and design PartFlow, a feed-forward network that leverages a source-aware latent space to enable high-fidelity 3D editing without requiring explicit edit masks. By integrating part-level data generation, pretrained 3D priors, mask-aware velocity preservation, and rendering-space consistency constraints, the method significantly enhances boundary sharpness, semantic coherence, and source structure retention in both geometric and appearance editing tasks, achieving state-of-the-art performance.
📝 Abstract
3D editing is a fundamental capability for scalable 3D content creation. While image editing has rapidly evolved toward large-scale feedforward generative paradigms, 3D AI generation remains dominated by training-free editing pipelines. A central challenge of feedforward 3D editing lies in the lack of high-quality paired supervision. Editable 3D assets require simultaneous preservation of geometry, multi-view consistency, structural coherence, and localized edit controllability. Existing 3D editing datasets often rely on independently generated assets, image-mediated reconstruction or narrow edit taxonomies, leading to inaccurate localization, weak preservation, blurred edit boundaries, and limited semantic consistency. In this work, we introduce a new perspective: scalable feedforward 3D editing should be learned from semantic-part transformations. Based on this insight, we propose Pxform, a high-quality 3D editing dataset with over 100K consistent before/after editing pairs across seven edit types. Instead of treating objects as unstructured shapes, our pipeline grounds edits directly in semantic 3D parts. Built upon Pxform, we further propose PartFlow, a feedforward 3D editing network that injects source-aware latent control into pretrained 3D generative priors. PartFlow introduces mask-aware velocity preservation and render-space consistency supervision to jointly improve edit fidelity and source preservation, while requiring no 3D edit mask during inference. Extensive experiments demonstrate that high-quality semantic-part supervision substantially improves scalable 3D editing, enabling PartFlow to achieve state-of-the-art performance on both geometric and appearance editing benchmarks.