Learning Ego-Centric BEV Representations from a Perspective-Privileged View: Cross-View Supervision for Online HD Map Construction

📅 2026-05-12

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

This work addresses the limitations of existing multi-camera bird’s-eye-view (BEV) methods, which rely on egocentric supervision and struggle to reliably reconstruct large-scale map structures under occlusion, sparse observations, and perspective distortion. To overcome these challenges, the authors propose a cross-view supervision (CVS) paradigm that introduces a top-down view as a “viewpoint-privileged” teacher model. Through feature-level knowledge distillation, geometric and topological priors are injected into the egocentric BEV encoder, aligning representations in a shared feature space to enhance structural consistency. Notably, this approach requires no modifications to the inference architecture or additional sensors. On the nuScenes benchmark, the method outperforms StreamMapNet by 3.9 mAP in the 60×30-meter region and by 9.9 mAP in the 100×50-meter region, achieving a 44% relative improvement in long-range performance.

📝 Abstract

Bird's-eye-view (BEV) representations derived from multi-camera input have become a central interface for online high-definition (HD) map construction. However, most approaches rely solely on ego-centric supervision, requiring large-scale scene structure to be inferred from incomplete observations, occlusions, and diminishing information density at long range, where perspective effects and spatial sparsity hinder consistent structural reasoning. We introduce Cross-View Supervision (CVS), a representation learning paradigm that transfers geometric and topological priors from an ego-aligned overhead perspective into camera-based BEV encoders. Rather than adding auxiliary semantic losses, CVS aligns representations in a shared BEV feature space and distills globally consistent structural knowledge from a perspective-privileged teacher into the ego-centric backbone. This supervision enhances structural coherence without modifying the inference architecture or requiring overhead input at test time. Experiments on nuScenes using ego-aligned aerial imagery from the AID4AD cross-view extension demonstrate consistent improvements over StreamMapNet while maintaining identical camera-only inference. CVS yields +3.9\,mAP in the standard $60\times30\,\mathrm{m}$ region and +9.9\,mAP in the extended $100\times50\,\mathrm{m}$ setting, corresponding to a 44\% relative gain at long range. These results highlight perspective-privileged structural supervision as a promising training principle for improving BEV representation learning in HD map construction.

Problem

Research questions and friction points this paper is trying to address.

BEV representation

online HD map construction

ego-centric supervision

structural reasoning

perspective-privileged view

Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-View Supervision

BEV representation learning

HD map construction