Beyond Voxel 3D Editing: Learning from 3D Masks and Self-Constructed Data

📅 2026-04-15

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

This work addresses key limitations in existing 3D editing methods—namely, insufficient semantic consistency, poor preservation of local invariance, limited control over editable regions, and a lack of large-scale training and evaluation data. To overcome these challenges, the authors propose the BVE framework, which introduces the first large-scale, self-constructed dataset tailored for 3D editing. Building upon image-to-3D generation models, BVE incorporates a lightweight trainable module to enable efficient text-driven editing. A novel unsupervised 3D masking mechanism is introduced to preserve geometric and appearance consistency in unedited regions without requiring additional supervision. Experimental results demonstrate that the proposed method generates high-quality, text-aligned 3D assets and significantly outperforms current approaches across multiple evaluation metrics.

Technology Category

Application Category

📝 Abstract

3D editing refers to the ability to apply local or global modifications to 3D assets. Effective 3D editing requires maintaining semantic consistency by performing localized changes according to prompts, while also preserving local invariance so that unchanged regions remain consistent with the original. However, existing approaches have significant limitations: multi-view editing methods incur losses when projecting back to 3D, while voxel-based editing is constrained in both the regions that can be modified and the scale of modifications. Moreover, the lack of sufficiently large editing datasets for training and evaluation remains a challenge. To address these challenges, we propose a Beyond Voxel 3D Editing (BVE) framework with a self-constructed large-scale dataset specifically tailored for 3D editing. Building upon this dataset, our model enhances a foundational image-to-3D generative architecture with lightweight, trainable modules, enabling efficient injection of textual semantics without the need for expensive full-model retraining. Furthermore, we introduce an annotation-free 3D masking strategy to preserve local invariance, maintaining the integrity of unchanged regions during editing. Extensive experiments demonstrate that BVE achieves superior performance in generating high-quality, text-aligned 3D assets, while faithfully retaining the visual characteristics of the original input.

Problem

Research questions and friction points this paper is trying to address.

3D editing

semantic consistency

local invariance

voxel-based editing

3D dataset

Innovation

Methods, ideas, or system contributions that make the work stand out.

3D editing

self-constructed dataset

lightweight trainable modules