Native 3D Editing with Full Attention

📅 2025-11-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing instruction-guided 3D editing methods suffer from either slow optimization-based pipelines or poor geometric consistency and visual fidelity in 2D feed-forward approaches. This paper introduces the first end-to-end, native 3D feed-forward editing framework that operates directly on 3D representations—e.g., 3D Gaussian Splatting—to perform additive, subtractive, and transformative edits. Our key contributions are: (1) an efficient 3D token concatenation mechanism that replaces cross-attention for superior conditional control; and (2) a large-scale multimodal 3D editing dataset comprising instruction–multi-view–mask triplets, explicitly designed to ensure edit accuracy and preservation of unedited regions. Experiments demonstrate that our method consistently outperforms state-of-the-art 2D lifting approaches in generation quality, 3D geometric consistency, and instruction adherence—while achieving significant gains in editing efficiency and visual fidelity—establishing a new benchmark for instruction-driven 3D editing.

Technology Category

Application Category

📝 Abstract
Instruction-guided 3D editing is a rapidly emerging field with the potential to broaden access to 3D content creation. However, existing methods face critical limitations: optimization-based approaches are prohibitively slow, while feed-forward approaches relying on multi-view 2D editing often suffer from inconsistent geometry and degraded visual quality. To address these issues, we propose a novel native 3D editing framework that directly manipulates 3D representations in a single, efficient feed-forward pass. Specifically, we create a large-scale, multi-modal dataset for instruction-guided 3D editing, covering diverse addition, deletion, and modification tasks. This dataset is meticulously curated to ensure that edited objects faithfully adhere to the instructional changes while preserving the consistency of unedited regions with the source object. Building upon this dataset, we explore two distinct conditioning strategies for our model: a conventional cross-attention mechanism and a novel 3D token concatenation approach. Our results demonstrate that token concatenation is more parameter-efficient and achieves superior performance. Extensive evaluations show that our method outperforms existing 2D-lifting approaches, setting a new benchmark in generation quality, 3D consistency, and instruction fidelity.
Problem

Research questions and friction points this paper is trying to address.

Addressing slow optimization and inconsistent geometry in 3D editing
Developing efficient feed-forward native 3D manipulation techniques
Ensuring instruction fidelity while preserving 3D consistency and quality
Innovation

Methods, ideas, or system contributions that make the work stand out.

Native 3D editing framework with single feed-forward pass
Large multi-modal dataset for instruction-guided 3D editing
3D token concatenation approach for parameter-efficient conditioning
🔎 Similar Papers
No similar papers found.
W
Weiwei Cai
Fudan University
S
Shuangkang Fang
StepFun, Inc.
Weicai Ye
Weicai Ye
Kling Team, Kuaishou Technology
Multimodal Generative Foundation ModelsWorld Model3D VisionEmbodied AIAGI
X
Xin Dong
Tsinghua University
Y
Yunhan Yang
VAST
Xuanyang Zhang
Xuanyang Zhang
StepFun AI Researcher
Neural Architecture DesignAIGC3D GenerationMulti-modal
W
Wei Cheng
StepFun, Inc.
Y
Yanpei Cao
VAST
G
Gang Yu
StepFun, Inc.
T
Tao Chen
Fudan University