SegDINO3D: 3D Instance Segmentation Empowered by Both Image-Level and Object-Level 2D Features

📅 2025-09-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the scarcity of annotated 3D training data and the challenge of effectively leveraging pre-trained 2D vision knowledge for 3D instance segmentation, this paper proposes SegDINO3D—a Transformer-based cross-modal 3D instance segmentation framework. Its core innovation lies in employing learnable 3D anchor boxes as queries and dynamically fusing image-level and object-level features from a pre-trained 2D detector via a lightweight cross-modal attention mechanism—bypassing storage of high-dimensional image feature maps and significantly enhancing 3D representation learning. The architecture comprises a 3D context fusion encoder and a dual-granularity 2D feature-guided decoder. SegDINO3D achieves state-of-the-art performance on both ScanNetV2 and ScanNet200 benchmarks, improving mAP by 8.7 and 6.8 percentage points on the ScanNet200 validation and hidden test sets, respectively.

Technology Category

Application Category

📝 Abstract
In this paper, we present SegDINO3D, a novel Transformer encoder-decoder framework for 3D instance segmentation. As 3D training data is generally not as sufficient as 2D training images, SegDINO3D is designed to fully leverage 2D representation from a pre-trained 2D detection model, including both image-level and object-level features, for improving 3D representation. SegDINO3D takes both a point cloud and its associated 2D images as input. In the encoder stage, it first enriches each 3D point by retrieving 2D image features from its corresponding image views and then leverages a 3D encoder for 3D context fusion. In the decoder stage, it formulates 3D object queries as 3D anchor boxes and performs cross-attention from 3D queries to 2D object queries obtained from 2D images using the 2D detection model. These 2D object queries serve as a compact object-level representation of 2D images, effectively avoiding the challenge of keeping thousands of image feature maps in the memory while faithfully preserving the knowledge of the pre-trained 2D model. The introducing of 3D box queries also enables the model to modulate cross-attention using the predicted boxes for more precise querying. SegDINO3D achieves the state-of-the-art performance on the ScanNetV2 and ScanNet200 3D instance segmentation benchmarks. Notably, on the challenging ScanNet200 dataset, SegDINO3D significantly outperforms prior methods by +8.7 and +6.8 mAP on the validation and hidden test sets, respectively, demonstrating its superiority.
Problem

Research questions and friction points this paper is trying to address.

Leveraging 2D image and object features for 3D instance segmentation
Addressing insufficient 3D training data with pre-trained 2D models
Improving cross-modal fusion between 3D point clouds and 2D images
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages 2D image and object features
Uses 3D anchor box cross-attention mechanism
Combines point cloud with 2D detection model
🔎 Similar Papers
No similar papers found.
Jinyuan Qu
Jinyuan Qu
Ph.D. Student, Tsinghua University
Computer Vision
H
Hongyang Li
South China University of Technology, International Digital Economy Academy (IDEA)
X
Xingyu Chen
Peking University, International Digital Economy Academy (IDEA)
Shilong Liu
Shilong Liu
RS@ByteDance, PhD@THU
Computer VisionObject DetectionVisual GroundingMulti-ModalityMultimodal Agent
Y
Yukai Shi
Tsinghua University, International Digital Economy Academy (IDEA)
Tianhe Ren
Tianhe Ren
PhD student of Electrical and Electronic Engineering, The University of Hong Kong
Computer VisionMachine LearningMulti-Modality
R
Ruitao Jing
Tsinghua University, International Digital Economy Academy (IDEA)
L
Lei Zhang
International Digital Economy Academy (IDEA)