🤖 AI Summary
Pure-vision bird’s-eye-view (BEV) map segmentation lags behind LiDAR-camera (LC) fusion models, and existing knowledge distillation (KD) methods often enhance student performance by enlarging the student architecture—sacrificing inference efficiency.
Method: We propose a lightweight Teacher Assistant (TA) network that bridges the representational gap between an LC teacher and a pure-vision student without modifying the student’s architecture or increasing its inference cost. The TA establishes a shared latent space and introduces a theoretically grounded dual-path distillation loss derived from Young’s inequality, enabling stable and efficient joint transfer of feature-level and output-level knowledge.
Results: On nuScenes, our method improves the pure-vision baseline by +4.2% mIoU, achieving 45% of the performance gain attained by the current state-of-the-art KD approach—significantly advancing the practicality of efficient pure-vision BEV perception.
📝 Abstract
Bird's-Eye-View (BEV) map segmentation is one of the most important and challenging tasks in autonomous driving. Camera-only approaches have drawn attention as cost-effective alternatives to LiDAR, but they still fall behind LiDAR-Camera (LC) fusion-based methods. Knowledge Distillation (KD) has been explored to narrow this gap, but existing methods mainly enlarge the student model by mimicking the teacher's architecture, leading to higher inference cost. To address this issue, we introduce BridgeTA, a cost-effective distillation framework to bridge the representation gap between LC fusion and Camera-only models through a Teacher Assistant (TA) network while keeping the student's architecture and inference cost unchanged. A lightweight TA network combines the BEV representations of the teacher and student, creating a shared latent space that serves as an intermediate representation. To ground the framework theoretically, we derive a distillation loss using Young's Inequality, which decomposes the direct teacher-student distillation path into teacher-TA and TA-student dual paths, stabilizing optimization and strengthening knowledge transfer. Extensive experiments on the challenging nuScenes dataset demonstrate the effectiveness of our method, achieving an improvement of 4.2% mIoU over the Camera-only baseline, up to 45% higher than the improvement of other state-of-the-art KD methods.