🤖 AI Summary
To address computational redundancy in online high-definition (HD) map construction caused by reliance on outdated offline maps and multi-modal sensors, this paper proposes MapKD, a multi-level cross-modal knowledge distillation framework. MapKD introduces a novel teacher–coach–student paradigm: a camera-LiDAR fusion model (incorporating prior map information) serves as the teacher; a lightweight vision-only model acts as the student; and a LiDAR-simulating coach model bridges the modality gap. To enable efficient feature alignment and semantic consistency, MapKD integrates token-guided 2D patch distillation (TGPD) and masked semantic response distillation (MSRD). Evaluated on nuScenes, the distilled student model achieves gains of +6.68 mIoU and +10.94 mAP over its baseline, while significantly accelerating inference. This demonstrates a synergistic optimization of accuracy and efficiency for real-time HD map generation.
📝 Abstract
Online HD map construction is a fundamental task in autonomous driving systems, aiming to acquire semantic information of map elements around the ego vehicle based on real-time sensor inputs. Recently, several approaches have achieved promising results by incorporating offline priors such as SD maps and HD maps or by fusing multi-modal data. However, these methods depend on stale offline maps and multi-modal sensor suites, resulting in avoidable computational overhead at inference. To address these limitations, we employ a knowledge distillation strategy to transfer knowledge from multimodal models with prior knowledge to an efficient, low-cost, and vision-centric student model. Specifically, we propose MapKD, a novel multi-level cross-modal knowledge distillation framework with an innovative Teacher-Coach-Student (TCS) paradigm. This framework consists of: (1) a camera-LiDAR fusion model with SD/HD map priors serving as the teacher; (2) a vision-centric coach model with prior knowledge and simulated LiDAR to bridge the cross-modal knowledge transfer gap; and (3) a lightweight vision-based student model. Additionally, we introduce two targeted knowledge distillation strategies: Token-Guided 2D Patch Distillation (TGPD) for bird's eye view feature alignment and Masked Semantic Response Distillation (MSRD) for semantic learning guidance. Extensive experiments on the challenging nuScenes dataset demonstrate that MapKD improves the student model by +6.68 mIoU and +10.94 mAP while simultaneously accelerating inference speed. The code is available at:https://github.com/2004yan/MapKD2026.