SP3D: Boosting Sparsely-Supervised 3D Object Detection via Accurate Cross-Modal Semantic Prompts

πŸ“… 2025-03-09
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This paper addresses the challenging problem of 3D object detection under extremely sparse supervisionβ€”e.g., only 1–5 3D bounding boxes per class. To tackle this, we propose a cross-modal semantic guidance framework grounded in large vision-language models (LVLMs). Our method introduces three key innovations: (1) a boundary-constrained center-based prompt selection mechanism (CPST) to improve pseudo-label localization accuracy; (2) joint optimization of dynamic cluster-based pseudo-label generation (DCPG) and distribution shape scoring (DS Score) to filter high-quality supervision signals; and (3) enhanced feature discriminability via point-cloud semantic transfer and multi-scale neighborhood geometric modeling. Evaluated on KITTI and Waymo Open Dataset, our approach significantly outperforms existing sparsely supervised methods and achieves state-of-the-art performance even under zero-shot settings.

Technology Category

Application Category

πŸ“ Abstract
Recently, sparsely-supervised 3D object detection has gained great attention, achieving performance close to fully-supervised 3D objectors while requiring only a few annotated instances. Nevertheless, these methods suffer challenges when accurate labels are extremely absent. In this paper, we propose a boosting strategy, termed SP3D, explicitly utilizing the cross-modal semantic prompts generated from Large Multimodal Models (LMMs) to boost the 3D detector with robust feature discrimination capability under sparse annotation settings. Specifically, we first develop a Confident Points Semantic Transfer (CPST) module that generates accurate cross-modal semantic prompts through boundary-constrained center cluster selection. Based on these accurate semantic prompts, which we treat as seed points, we introduce a Dynamic Cluster Pseudo-label Generation (DCPG) module to yield pseudo-supervision signals from the geometry shape of multi-scale neighbor points. Additionally, we design a Distribution Shape score (DS score) that chooses high-quality supervision signals for the initial training of the 3D detector. Experiments on the KITTI dataset and Waymo Open Dataset (WOD) have validated that SP3D can enhance the performance of sparsely supervised detectors by a large margin under meager labeling conditions. Moreover, we verified SP3D in the zero-shot setting, where its performance exceeded that of the state-of-the-art methods. The code is available at https://github.com/xmuqimingxia/SP3D.
Problem

Research questions and friction points this paper is trying to address.

Enhance 3D object detection with sparse annotations.
Generate accurate cross-modal semantic prompts using LMMs.
Improve detector performance in zero-shot and low-label settings.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Utilizes cross-modal semantic prompts from LMMs
Develops CPST for accurate semantic prompts
Introduces DCPG for pseudo-supervision signals
πŸ”Ž Similar Papers
No similar papers found.
S
Shijia Zhao
Fujian Key Laboratory of Sensing and Computing for Smart Cities, Xiamen University, Xiamen, China; Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, China
Q
Qiming Xia
Fujian Key Laboratory of Sensing and Computing for Smart Cities, Xiamen University, Xiamen, China; Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, China
X
Xusheng Guo
Fujian Key Laboratory of Sensing and Computing for Smart Cities, Xiamen University, Xiamen, China; Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, China
P
Pufan Zou
Fujian Key Laboratory of Sensing and Computing for Smart Cities, Xiamen University, Xiamen, China; Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, China
M
Maoji Zheng
Fujian Key Laboratory of Sensing and Computing for Smart Cities, Xiamen University, Xiamen, China; Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, China
Hai Wu
Hai Wu
The University of Hong Kong
Chenglu Wen
Chenglu Wen
Professor of Xiamen University
3D visionpoint cloudsmobile mappingrobotics
C
Cheng Wang
Fujian Key Laboratory of Sensing and Computing for Smart Cities, Xiamen University, Xiamen, China; Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, China