Bridging Perspectives: Foundation Model Guided BEV Maps for 3D Object Detection and Tracking

📅 2025-10-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing camera-based 3D detection and tracking methods are largely confined to either perspective view (PV) or bird’s-eye view (BEV), struggling to jointly capture fine-grained appearance cues and global spatial structure. To address this, we propose DualViewDistill—a novel framework that, for the first time, distills semantic-geometric features from the DINOv2 foundation model into BEV space and enables synergistic optimization of PV-detail and BEV-structured representations via deformable cross-view aggregation. Our key contributions are: (1) foundation-model-guided BEV semantic enhancement, and (2) a dual-view feature complementary distillation paradigm. Extensive experiments on nuScenes and Argoverse 2 demonstrate state-of-the-art performance in both 3D detection and multi-object tracking, with significant accuracy gains. These results validate the effectiveness and generalizability of foundation-model-driven BEV representation learning.

Technology Category

Application Category

📝 Abstract
Camera-based 3D object detection and tracking are essential for perception in autonomous driving. Current state-of-the-art approaches often rely exclusively on either perspective-view (PV) or bird's-eye-view (BEV) features, limiting their ability to leverage both fine-grained object details and spatially structured scene representations. In this work, we propose DualViewDistill, a hybrid detection and tracking framework that incorporates both PV and BEV camera image features to leverage their complementary strengths. Our approach introduces BEV maps guided by foundation models, leveraging descriptive DINOv2 features that are distilled into BEV representations through a novel distillation process. By integrating PV features with BEV maps enriched with semantic and geometric features from DINOv2, our model leverages this hybrid representation via deformable aggregation to enhance 3D object detection and tracking. Extensive experiments on the nuScenes and Argoverse 2 benchmarks demonstrate that DualViewDistill achieves state-of-the-art performance. The results showcase the potential of foundation model BEV maps to enable more reliable perception for autonomous driving. We make the code and pre-trained models available at https://dualviewdistill.cs.uni-freiburg.de .
Problem

Research questions and friction points this paper is trying to address.

Combining perspective-view and BEV features for autonomous driving perception
Distilling foundation model features into BEV representations
Enhancing 3D object detection and tracking with hybrid representations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Foundation models guide BEV maps creation
Distills DINOv2 features into BEV representations
Integrates PV and BEV features via deformable aggregation
🔎 Similar Papers
No similar papers found.
M
Markus Käppeler
Department of Computer Science, University of Freiburg, Germany
Ö
Özgün Çiçek
Bosch Research, Robert Bosch GmbH, Renningen, Germany
D
Daniele Cattaneo
Department of Computer Science, University of Freiburg, Germany
Claudius Gläser
Claudius Gläser
Robert Bosch GmbH
Automated DrivingPerceptionMachine Learning
Yakov Miron
Yakov Miron
Senior Research Manager and Scientist, Bosch-AI
AIMachine LearningAutomated Driving
Abhinav Valada
Abhinav Valada
Professor & Director of Robot Learning Lab, University of Freiburg
RoboticsMachine LearningComputer VisionArtificial Intelligence