🤖 AI Summary
Existing 2D vision foundation models (e.g., DINOv2, CLIP, SAM) struggle to generalize directly to 3D scene understanding due to the dimensionality gap and lack of explicit 3D geometric priors.
Method: This paper proposes a learning-free 2D feature uplift paradigm that efficiently maps semantic features onto 3D Gaussian Splatting representations. Instead of relying on conventional reconstruction losses, it introduces a graph diffusion mechanism that jointly encodes 3D geometric structure and DINOv2 feature similarity to enable cross-dimensional feature alignment and propagation, followed by lightweight feature aggregation for 3D semantic decoding.
Contribution/Results: The method achieves state-of-the-art performance on 3D object segmentation and open-vocabulary localization, leveraging only off-the-shelf DINOv2 features—requiring no task-specific annotations or fine-tuning. It exhibits strong generalization, significantly accelerates inference, and eliminates the need for learnable parameters in the feature lifting stage.
📝 Abstract
We address the problem of extending the capabilities of vision foundation models such as DINO, SAM, and CLIP, to 3D tasks. Specifically, we introduce a novel method to uplift 2D image features into Gaussian Splatting representations of 3D scenes. Unlike traditional approaches that rely on minimizing a reconstruction loss, our method employs a simpler and more efficient feature aggregation technique, augmented by a graph diffusion mechanism. Graph diffusion refines 3D features, such as coarse segmentation masks, by leveraging 3D geometry and pairwise similarities induced by DINOv2. Our approach achieves performance comparable to the state of the art on multiple downstream tasks while delivering significant speed-ups. Notably, we obtain competitive segmentation results using generic DINOv2 features, despite DINOv2 not being trained on millions of annotated segmentation masks like SAM. When applied to CLIP features, our method demonstrates strong performance in open-vocabulary object localization tasks, highlighting the versatility of our approach.