RaCFormer: Towards High-Quality 3D Object Detection via Query-based Radar-Camera Fusion

📅 2024-12-17

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

In outdoor 3D perception, inaccurate depth estimation causes misalignment between image and bird’s-eye view (BEV) features during cross-modal BEV projection. To address this, we propose a radar-camera fusion framework for adaptive BEV construction. Our method employs a Transformer backbone to enable joint cross-view querying and feature aggregation between images and BEV. Key contributions include: (1) learnable ring-shaped query initialization in polar coordinates, improving BEV spatial modeling efficiency; (2) a radar-guided depth estimation head that enhances depth prediction robustness; and (3) a Doppler-aware implicit motion capturer that models temporal dynamics in BEV space. Evaluated on nuScenes, our approach achieves 64.9% mAP and 70.2% NDS—significantly surpassing state-of-the-art LiDAR-based methods—and ranks first on the VoD leaderboard.

Technology Category

Application Category

📝 Abstract

We propose Radar-Camera fusion transformer (RaCFormer) to boost the accuracy of 3D object detection by the following insight. The Radar-Camera fusion in outdoor 3D scene perception is capped by the image-to-BEV transformation--if the depth of pixels is not accurately estimated, the naive combination of BEV features actually integrates unaligned visual content. To avoid this problem, we propose a query-based framework that enables adaptively sample instance-relevant features from both the BEV and the original image view. Furthermore, we enhance system performance by two key designs: optimizing query initialization and strengthening the representational capacity of BEV. For the former, we introduce an adaptive circular distribution in polar coordinates to refine the initialization of object queries, allowing for a distance-based adjustment of query density. For the latter, we initially incorporate a radar-guided depth head to refine the transformation from image view to BEV. Subsequently, we focus on leveraging the Doppler effect of radar and introduce an implicit dynamic catcher to capture the temporal elements within the BEV. Extensive experiments on nuScenes and View-of-Delft (VoD) datasets validate the merits of our design. Remarkably, our method achieves superior results of 64.9% mAP and 70.2% NDS on nuScenes, even outperforming several LiDAR-based detectors. RaCFormer also secures the 1st ranking on the VoD dataset. The code will be released.

Problem

Research questions and friction points this paper is trying to address.

Enhancing 3D object detection via Radar-Camera fusion

Improving image-to-BEV transformation accuracy

Optimizing query initialization and BEV representation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Query-based adaptive feature sampling

Polar coordinate query initialization

Radar-guided depth and Doppler enhancement

🔎 Similar Papers

No similar papers found.