🤖 AI Summary
Existing camera-radar fusion methods for 3D object detection fail to fully exploit the radar point cloud’s advantages in range and radial velocity estimation, while suffering from high computational overhead and poor real-time performance due to complex multimodal fusion architectures. This paper proposes a lightweight camera-radar fusion network leveraging柱ar attention mechanisms for real-time bird’s-eye view (BEV) detection. First, a radar pillar feature embedding module explicitly encodes range and radial velocity information. Second, an intra-pillar self-attention mechanism models geometric and kinematic dependencies among points within each pillar. Third, a simplified convolutional fusion module replaces the conventional Feature Pyramid Network (FPN) to reduce feature aggregation complexity. Evaluated on nuScenes, our method achieves a new state-of-the-art 58.2 NDS with 42 FPS inference speed—the fastest among comparable approaches—while significantly improving robustness and efficiency under adverse environmental conditions.
📝 Abstract
Camera-radar fusion offers a robust and low-cost alternative to Camera-lidar fusion for the 3D object detection task in real-time under adverse weather and lighting conditions. However, currently, in the literature, it is possible to find few works focusing on this modality and, most importantly, developing new architectures to explore the advantages of the radar point cloud, such as accurate distance estimation and speed information. Therefore, this work presents a novel and efficient 3D object detection algorithm using cameras and radars in the bird's-eye-view (BEV). Our algorithm exploits the advantages of radar before fusing the features into a detection head. A new backbone is introduced, which maps the radar pillar features into an embedded dimension. A self-attention mechanism allows the backbone to model the dependencies between the radar points. We are using a simplified convolutional layer to replace the FPN-based convolutional layers used in the PointPillars-based architectures with the main goal of reducing inference time. Our results show that with this modification, our approach achieves the new state-of-the-art in the 3D object detection problem, reaching 58.2 of the NDS metric for the use of ResNet-50, while also setting a new benchmark for inference time on the nuScenes dataset for the same category.