3DGeoDet: General-purpose Geometry-aware Image-based 3D Object Detection

📅 2025-06-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Single- and multi-view 3D object detection from images suffers from depth ambiguity due to the absence of explicit geometric cues. To address this, we propose the first geometry-aware dual-path framework that jointly leverages explicit voxel occupancy modeling and implicit unsupervised Truncated Signed Distance Function (TSDF) representation. Our method constructs a voxelized feature volume, introduces a voxel occupancy attention mechanism, and fuses multi-view features—all within an end-to-end trainable architecture requiring no 3D bounding box annotations. Evaluated on SUN RGB-D, ScanNetV2, and KITTI, our approach achieves absolute improvements of +9.3%, +3.3%, and +0.19% in mAP@0.5 (SUN RGB-D, ScanNetV2) and AP₃D@0.7 (KITTI), respectively, significantly outperforming existing image-only methods. These results demonstrate that synergistic explicit–implicit 3D representation substantially enhances geometric reasoning capability for monocular and multi-view 3D detection.

Technology Category

Application Category

📝 Abstract
This paper proposes 3DGeoDet, a novel geometry-aware 3D object detection approach that effectively handles single- and multi-view RGB images in indoor and outdoor environments, showcasing its general-purpose applicability. The key challenge for image-based 3D object detection tasks is the lack of 3D geometric cues, which leads to ambiguity in establishing correspondences between images and 3D representations. To tackle this problem, 3DGeoDet generates efficient 3D geometric representations in both explicit and implicit manners based on predicted depth information. Specifically, we utilize the predicted depth to learn voxel occupancy and optimize the voxelized 3D feature volume explicitly through the proposed voxel occupancy attention. To further enhance 3D awareness, the feature volume is integrated with an implicit 3D representation, the truncated signed distance function (TSDF). Without requiring supervision from 3D signals, we significantly improve the model's comprehension of 3D geometry by leveraging intermediate 3D representations and achieve end-to-end training. Our approach surpasses the performance of state-of-the-art image-based methods on both single- and multi-view benchmark datasets across diverse environments, achieving a 9.3 mAP@0.5 improvement on the SUN RGB-D dataset, a 3.3 mAP@0.5 improvement on the ScanNetV2 dataset, and a 0.19 AP3D@0.7 improvement on the KITTI dataset. The project page is available at: https://cindy0725.github.io/3DGeoDet/.
Problem

Research questions and friction points this paper is trying to address.

Lack of 3D geometric cues in image-based detection
Ambiguity in image-to-3D representation correspondences
Need for general-purpose indoor/outdoor 3D detection
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses voxel occupancy attention for 3D feature volume
Integrates TSDF for implicit 3D representation
Leverages predicted depth without 3D supervision
🔎 Similar Papers
No similar papers found.
Y
Yi Zhang
Department of Electrical and Electronic Engineering, The Hong Kong Polytechnic University, Hong Kong, China
Y
Yi Wang
Department of Electrical and Electronic Engineering, The Hong Kong Polytechnic University, Hong Kong, China
Yawen Cui
Yawen Cui
University of Oulu
Few-Shot LearningContinual LearningMultimodal Learning
Lap-Pui Chau
Lap-Pui Chau
The Hong Kong Polytechnic University
Visual Signal Processing