🤖 AI Summary
This work addresses the challenge of learning structured 3D features from single-view 2D image supervision. We propose the first feature distillation-based framework for multi-field decoupled implicit representation. Our method explicitly decomposes the 3D feature field into two orthogonal components: view-invariant (geometric/semantic) and view-dependent (e.g., reflectance), enforced by structural decoupling constraints and 2D–3D cross-dimensional consistency regularization. Leveraging only pretrained 2D vision features—without any 3D annotations—we enable end-to-end optimization. Unlike conventional monolithic volumetric representations, our approach supports fine-grained, interactive editing of semantic and physical attributes (e.g., specular reflection removal). It achieves state-of-the-art performance on 3D segmentation and, for the first time, enables single-image-driven interactive 3D segmentation, attribute editing, and physically controllable effect removal.
📝 Abstract
Recent work has demonstrated the ability to leverage or distill pre-trained 2D features obtained using large pre-trained 2D models into 3D features, enabling impressive 3D editing and understanding capabilities using only 2D supervision. Although impressive, models assume that 3D features are captured using a single feature field and often make a simplifying assumption that features are view-independent. In this work, we propose instead to capture 3D features using multiple disentangled feature fields that capture different structural components of 3D features involving view-dependent and view-independent components, which can be learned from 2D feature supervision only. Subsequently, each element can be controlled in isolation, enabling semantic and structural understanding and editing capabilities. For instance, using a user click, one can segment 3D features corresponding to a given object and then segment, edit, or remove their view-dependent (reflective) properties. We evaluate our approach on the task of 3D segmentation and demonstrate a set of novel understanding and editing tasks.