🤖 AI Summary
To address the problem of missed detections in vision-based BEV 3D object detection caused by low appearance contrast between objects and background, this paper proposes a Region-Oriented Attention (ROA) mechanism. ROA is the first approach to incorporate coarse-grained 2D detection priors into BEV feature learning, explicitly guiding the backbone network to focus on potential object regions. It jointly leverages multi-scale features, large-kernel convolutions, and region-adaptive weighting to enhance sensitivity to small objects while expanding receptive field coverage for large objects. Implemented atop the BEVDet/BEVDepth framework, ROA requires no additional annotations or complex architectural modifications. On the nuScenes benchmark, it achieves absolute improvements of +3.2% in mAP and +2.1% in NDS over the respective baselines, significantly outperforming both BEVDet and BEVDepth. These results demonstrate that region-guided attention effectively enhances the robustness and discriminability of BEV representations.
📝 Abstract
Vision-based BEV (Bird-Eye-View) 3D object detection has recently become popular in autonomous driving. However, objects with a high similarity to the background from a camera perspective cannot be detected well by existing methods. In this paper, we propose 2D Region-oriented Attention for a BEV-based 3D Object Detection Network (ROA-BEV), which can make the backbone focus more on feature learning in areas where objects may exist. Moreover, our method increases the information content of ROA through a multi-scale structure. In addition, every block of ROA utilizes a large kernel to ensure that the receptive field is large enough to catch large objects' information. Experiments on nuScenes show that ROA-BEV improves the performance based on BEVDet and BEVDepth. The code will be released soon.