🤖 AI Summary
This work addresses the challenge that existing end-to-end autonomous driving systems overly rely on sensor inputs and struggle to effectively fuse heterogeneous map priors—such as vector maps, raster maps, and satellite imagery—that exhibit inconsistent availability and pose drift during testing. To this end, the authors propose a Unified Map Prior Encoder (UMPE) featuring a dual-branch architecture for geometry-aware alignment and fusion: the vector branch employs SE(2) pre-alignment with confidence-weighted cross-attention, while the raster branch integrates a FiLM-conditioned ResNet-18 with zero-initialized residual fusion. UMPE is the first framework to enable unified encoding of arbitrary combinations of map priors and demonstrates power-set robustness—achieving superior performance even when tested with only a single prior despite being trained with all available priors. Experiments on nuScenes and Argoverse2 show consistent improvements in mAP (+5.9/+5.3 and +4.1) for MapTRv2/MapQR, along with reduced trajectory error (−0.30 m) and collision rate (−0.10%) in end-to-end planning.
📝 Abstract
Online mapping and end-to-end (E2E) planning in autonomous driving remain largely sensor-centric, leaving rich map priors, including HD/SD vector maps, rasterized SD maps, and satellite imagery, underused because of heterogeneity, pose drift, and inconsistent availability at test time. We present UMPE, a Unified Map Prior Encoder that can ingest any subset of four priors and fuse them with BEV features for both mapping and planning. UMPE has two branches. The vector encoder pre-aligns HD/SD polylines with a frame-wise SE(2) correction, encodes points via multi-frequency sinusoidal features, and produces polyline tokens with confidence scores. BEV queries then apply cross-attention with confidence bias, followed by normalized channel-wise gating to avoid length imbalance and softly down-weight uncertain sources. The raster encoder shares a ResNet-18 backbone conditioned by FiLM with scaling and shift at every stage, performs SE(2) micro-alignment, and injects priors through zero-initialized residual fusion, so the network starts from a do-no-harm baseline and learns to add only useful prior evidence. A vector-then-raster fusion order reflects the inductive bias of geometry first, appearance second. On nuScenes mapping, UMPE lifts MapTRv2 from 61.5 to 67.4 mAP (+5.9) and MapQR from 66.4 to 71.7 mAP (+5.3). On Argoverse2, UMPE adds +4.1 mAP over strong baselines. UMPE is compositional: when trained with all priors, it outperforms single-prior models even when only one prior is available at test time, demonstrating powerset robustness. For E2E planning with the VAD backbone on nuScenes, UMPE reduces trajectory error from 0.72 to 0.42 m L2 on average (-0.30 m) and collision rate from 0.22% to 0.12% (-0.10%), surpassing recent prior-injection methods. These results show that a unified, alignment-aware treatment of heterogeneous map priors yields better mapping and better planning.