OA-BEV: Bringing Object Awareness to Bird's-Eye-View Representation for Multi-Camera 3D Object Detection

📅 2023-01-13

🏛️ arXiv.org

📈 Citations: 9

✨ Influential: 0

career value

221K/year

🤖 AI Summary

To address feature distortion and boundary ambiguity in multi-camera BEV 3D detection—stemming from the ill-posed image-to-3D mapping—this paper proposes an object-aware pseudo-3D–depth joint modeling framework. Our method introduces two key innovations: (1) a novel object-level depth supervision scheme coupled with pseudo-voxel encoding, enhancing structural consistency in depth estimation; and (2) a 2D detection-guided foreground pixel projection mechanism integrated with deformable attention fusion, enabling precise spatial alignment and contextual enhancement. These components collectively improve object structural representation and spatial localization accuracy in the BEV feature space. Evaluated on nuScenes, our approach achieves 72.5% mAP and 78.3% NDS, outperforming state-of-the-art baselines including BEVDet and BEVFormer.

📝 Abstract

The recent trend for multi-camera 3D object detection is through the unified bird's-eye view (BEV) representation. However, directly transforming features extracted from the image-plane view to BEV inevitably results in feature distortion, especially around the objects of interest, making the objects blur into the background. To this end, we propose OA-BEV, a network that can be plugged into the BEV-based 3D object detection framework to bring out the objects by incorporating object-aware pseudo-3D features and depth features. Such features contain information about the object's position and 3D structures. First, we explicitly guide the network to learn the depth distribution by object-level supervision from each 3D object's center. Then, we select the foreground pixels by a 2D object detector and project them into 3D space for pseudo-voxel feature encoding. Finally, the object-aware depth features and pseudo-voxel features are incorporated into the BEV representation with a deformable attention mechanism. We conduct extensive experiments on the nuScenes dataset to validate the merits of our proposed OA-BEV. Our method achieves consistent improvements over the BEV-based baselines in terms of both average precision and nuScenes detection score. Our codes will be published.

Problem

Research questions and friction points this paper is trying to address.

Addresses feature clutter in multi-camera 3D object detection

Improves object-background differentiation in BEV representations

Enhances 3D detection accuracy using object-aware depth features

Innovation

Methods, ideas, or system contributions that make the work stand out.

Object-centric depth learning with bounding box supervision

Foreground pseudo points projection for voxel encoding

Deformable attention integration into BEV representation

🔎 Similar Papers

DuoSpaceNet: Leveraging Both Bird's-Eye-View and Perspective View Representations for 3D Object Detection