Leveraging Previous-Traversal Point Cloud Map Priors for Camera-Based 3D Object Detection and Tracking

📅 2026-04-28

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

This work addresses the challenge of depth ambiguity in monocular 3D object detection and tracking, which leads to insufficient localization accuracy in the absence of online LiDAR. To mitigate this limitation, the authors propose DualViewMapDet, a novel framework that leverages a pre-built static point cloud map as geometric prior during inference. The method introduces a dual-space fusion mechanism operating jointly in the perspective view (PV) and bird’s-eye view (BEV), avoiding unidirectional view transformation. By integrating multi-channel geometric cues via projection, a sparse voxel-based backbone, and alignment in a shared metric space, the framework effectively fuses map priors with visual features. Experiments demonstrate that DualViewMapDet significantly outperforms existing vision-only baselines on both nuScenes and Argoverse 2 benchmarks, achieving particularly notable gains in 3D localization accuracy.

📝 Abstract

Camera-based 3D object detection and tracking are central to autonomous driving, yet precise 3D object localization remains fundamentally constrained by depth ambiguity when no expensive, depth-rich online LiDAR is available at inference. In many deployments, however, vehicles repeatedly traverse the same environments, making static point cloud maps from prior traversals a practical source of geometric priors. We propose DualViewMapDet, a camera-only inference framework that retrieves such map priors online and leverages them to mitigate the absence of a LiDAR sensor during deployment. The key idea is a dual-space camera-map fusion strategy that avoids one-sided view conversion. Specifically, we (i) project the map into perspective view (PV) and encode multi-channel geometric cues to enrich image features and support BEV lifting, and (ii) encode the map directly in bird's-eye view (BEV) with a sparse voxel backbone and fuse it with lifted camera features in a shared metric space. Extensive evaluations on nuScenes and Argoverse 2 demonstrate consistent improvements over strong camera-only baselines, with particularly strong gains in object localization. Ablations further validate the contributions of PV/BEV fusion and prior-map coverage. We make the code and pre-trained models available at https://dualviewmapdet.cs.uni-freiburg.de .

Problem

Research questions and friction points this paper is trying to address.

3D object detection

depth ambiguity

camera-only

object localization

autonomous driving

Innovation

Methods, ideas, or system contributions that make the work stand out.

camera-only 3D detection

map priors

dual-view fusion