NeRF-DetS: Enhanced Adaptive Spatial-wise Sampling and View-wise Fusion Strategies for NeRF-based Indoor Multi-view 3D Object Detection

📅 2024-04-22

📈 Citations: 1

✨ Influential: 0

career value

202K/year

🤖 AI Summary

To address the challenges of large variations in object position and scale, as well as severe occlusion in indoor multi-view 3D object detection—which degrade detection accuracy—this paper proposes an end-to-end detection framework leveraging Neural Radiance Fields (NeRF) for implicit scene representation. Our method introduces three key components: (1) a Progressive Adaptive Sampling Strategy (PASS) for inter-layer dynamic feature sampling; (2) a Depth-Guided Simplified Multi-Head Attention (DS-MHA) module to efficiently fuse cross-view features while explicitly modeling occlusion; and (3) an integrated design combining NeRF-based scene reconstruction, learnable offset sampling, and a dense 3D detector. Evaluated on ScanNetV2, our approach achieves absolute improvements of +5.02% and +5.92% in mAP@IoU25 and mAP@IoU50, respectively. Furthermore, strong generalization is demonstrated on ARKitScenes, confirming robustness across diverse indoor scenes.

Technology Category

Application Category

📝 Abstract

In indoor scenes, the diverse distribution of object locations and scales makes the visual 3D perception task a big challenge. Previous works (e.g, NeRF-Det) have demonstrated that implicit representation has the capacity to benefit the visual 3D perception task in indoor scenes with high amount of overlap between input images. However, previous works cannot fully utilize the advancement of implicit representation because of fixed sampling and simple multi-view feature fusion. In this paper, inspired by sparse fashion method (e.g, DETR3D), we propose a simple yet effective method, NeRF-DetS, to address above issues. NeRF-DetS includes two modules: Progressive Adaptive Sampling Strategy (PASS) and Depth-Guided Simplified Multi-Head Attention Fusion (DS-MHA). Specifically, (1)PASS can automatically sample features of each layer within a dense 3D detector, using offsets predicted by the previous layer. (2)DS-MHA can not only efficiently fuse multi-view features with strong occlusion awareness but also reduce computational cost. Extensive experiments on ScanNetV2 dataset demonstrate our NeRF-DetS outperforms NeRF-Det, by achieving +5.02% and +5.92% improvement in mAP under IoU25 and IoU50, respectively. Also, NeRF-DetS shows consistent improvements on ARKITScenes.

Problem

Research questions and friction points this paper is trying to address.

3D object detection

indoor environment

variability in position and size

Innovation

Methods, ideas, or system contributions that make the work stand out.

PASS (Progressive Adaptive Sampling Strategy)

DS-MHA (Depth-guided Simplified Multi-head Attention Fusion)

Multi-view 3D Object Detection

🔎 Similar Papers

No similar papers found.