SimPB++: Simultaneously Detecting 2D and 3D Objects from Multiple Cameras

📅 2026-05-03
📈 Citations: 0
Influential: 0
📄 PDF

career value

206K/year
🤖 AI Summary
This work addresses the challenge of unifying 2D perspective-view and 3D bird’s-eye-view (BEV) perception in multi-camera autonomous driving by proposing SimPB++, an end-to-end model featuring a hybrid decoder architecture. The method establishes a 3D–2D–3D iterative refinement loop through dynamic query assignment, adaptive query aggregation, and Query-group Attention. It further incorporates Crop-and-Scale and Propagating Denoising strategies to enable mixed supervision from both 2D-only and fully annotated data, substantially reducing reliance on costly 3D labels. Evaluated on nuScenes, SimPB++ achieves state-of-the-art performance in both 2D and 3D object detection, while demonstrating strong long-range detection capability—up to 150 meters—on Argoverse2.
📝 Abstract
Simultaneous perception of 2D objects in perspective view and 3D objects in Bird's Eye View (BEV) is challenging for multi-camera autonomous driving. Existing two-stage pipelines use 2D results only as a one-time cue for 3D detection. We propose SimPB++, which simultaneously detects 2D objects in perspective and 3D objects in BEV from multiple cameras. It unifies both tasks into an end-to-end model with a hybrid decoder architecture, coupling multi-view 2D and 3D decoders interactively. Two novel modules enable deep interaction: Dynamic Query Allocation adaptively assigns 2D queries to 3D candidates, and Adaptive Query Aggregation refines 3D representations using multi-view 2D features, forming a cyclic 3D-2D-3D refinement. For multi-view 2D detection, we use Query-group Attention for intra-group communication. We also design a Crop-and-Scale strategy for long-range perception and a Propagating Denoising strategy with an auxiliary RoI detector. SimPB++ supports mixed supervision with 2D-only and fully annotated data, reducing reliance on expensive 3D labels. Experiments show state-of-the-art performance on nuScenes for both tasks and strong long-range detection (up to 150m) on Argoverse2.
Problem

Research questions and friction points this paper is trying to address.

multi-camera perception
2D object detection
3D object detection
Bird's Eye View
autonomous driving
Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-camera perception
2D-3D joint detection
Bird's Eye View (BEV)
dynamic query allocation
end-to-end unified model