SegDINO3D: 3D Instance Segmentation Empowered by Both Image-Level and Object-Level 2D Features

📅 2025-09-19

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

To address the scarcity of annotated 3D training data and the challenge of effectively leveraging pre-trained 2D vision knowledge for 3D instance segmentation, this paper proposes SegDINO3D—a Transformer-based cross-modal 3D instance segmentation framework. Its core innovation lies in employing learnable 3D anchor boxes as queries and dynamically fusing image-level and object-level features from a pre-trained 2D detector via a lightweight cross-modal attention mechanism—bypassing storage of high-dimensional image feature maps and significantly enhancing 3D representation learning. The architecture comprises a 3D context fusion encoder and a dual-granularity 2D feature-guided decoder. SegDINO3D achieves state-of-the-art performance on both ScanNetV2 and ScanNet200 benchmarks, improving mAP by 8.7 and 6.8 percentage points on the ScanNet200 validation and hidden test sets, respectively.

Technology Category

Application Category

📝 Abstract

In this paper, we present SegDINO3D, a novel Transformer encoder-decoder framework for 3D instance segmentation. As 3D training data is generally not as sufficient as 2D training images, SegDINO3D is designed to fully leverage 2D representation from a pre-trained 2D detection model, including both image-level and object-level features, for improving 3D representation. SegDINO3D takes both a point cloud and its associated 2D images as input. In the encoder stage, it first enriches each 3D point by retrieving 2D image features from its corresponding image views and then leverages a 3D encoder for 3D context fusion. In the decoder stage, it formulates 3D object queries as 3D anchor boxes and performs cross-attention from 3D queries to 2D object queries obtained from 2D images using the 2D detection model. These 2D object queries serve as a compact object-level representation of 2D images, effectively avoiding the challenge of keeping thousands of image feature maps in the memory while faithfully preserving the knowledge of the pre-trained 2D model. The introducing of 3D box queries also enables the model to modulate cross-attention using the predicted boxes for more precise querying. SegDINO3D achieves the state-of-the-art performance on the ScanNetV2 and ScanNet200 3D instance segmentation benchmarks. Notably, on the challenging ScanNet200 dataset, SegDINO3D significantly outperforms prior methods by +8.7 and +6.8 mAP on the validation and hidden test sets, respectively, demonstrating its superiority.

Problem

Research questions and friction points this paper is trying to address.

Leveraging 2D image and object features for 3D instance segmentation

Addressing insufficient 3D training data with pre-trained 2D models

Improving cross-modal fusion between 3D point clouds and 2D images

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages 2D image and object features

Uses 3D anchor box cross-attention mechanism

Combines point cloud with 2D detection model

🔎 Similar Papers

No similar papers found.