🤖 AI Summary
This work addresses the challenge of unifying 2D perspective-view and 3D bird’s-eye-view (BEV) perception in multi-camera autonomous driving by proposing SimPB++, an end-to-end model featuring a hybrid decoder architecture. The method establishes a 3D–2D–3D iterative refinement loop through dynamic query assignment, adaptive query aggregation, and Query-group Attention. It further incorporates Crop-and-Scale and Propagating Denoising strategies to enable mixed supervision from both 2D-only and fully annotated data, substantially reducing reliance on costly 3D labels. Evaluated on nuScenes, SimPB++ achieves state-of-the-art performance in both 2D and 3D object detection, while demonstrating strong long-range detection capability—up to 150 meters—on Argoverse2.
📝 Abstract
Simultaneous perception of 2D objects in perspective view and 3D objects in Bird's Eye View (BEV) is challenging for multi-camera autonomous driving. Existing two-stage pipelines use 2D results only as a one-time cue for 3D detection. We propose SimPB++, which simultaneously detects 2D objects in perspective and 3D objects in BEV from multiple cameras. It unifies both tasks into an end-to-end model with a hybrid decoder architecture, coupling multi-view 2D and 3D decoders interactively. Two novel modules enable deep interaction: Dynamic Query Allocation adaptively assigns 2D queries to 3D candidates, and Adaptive Query Aggregation refines 3D representations using multi-view 2D features, forming a cyclic 3D-2D-3D refinement. For multi-view 2D detection, we use Query-group Attention for intra-group communication. We also design a Crop-and-Scale strategy for long-range perception and a Propagating Denoising strategy with an auxiliary RoI detector. SimPB++ supports mixed supervision with 2D-only and fully annotated data, reducing reliance on expensive 3D labels. Experiments show state-of-the-art performance on nuScenes for both tasks and strong long-range detection (up to 150m) on Argoverse2.