🤖 AI Summary
To address inaccurate temporal perception in vehicle-infrastructure cooperative sensing—caused by occlusions, blind spots, and LiDAR calibration errors—this paper proposes LET-VIC, an end-to-end LiDAR-based temporal collaborative perception framework. Methodologically, we introduce the first Vehicle-Infrastructure Cross-attention (VIC) mechanism to jointly align spatiotemporal features across heterogeneous vehicle- and infrastructure-mounted LiDAR views. Additionally, we design a learnable Calibration Error Compensation (CEC) module that enables automatic calibration error correction and end-to-end joint training. Evaluated on the V2X-Seq-SPD benchmark, LET-VIC achieves 15.0% higher mAP and 17.3% higher AMOTA than the baseline LET-V, and outperforms state-of-the-art methods—including V2VNet—by at least 13.7% (mAP) and 13.1% (AMOTA). These gains demonstrate significantly enhanced detection and tracking robustness under complex urban driving scenarios with severe occlusion and sensor misalignment.
📝 Abstract
Temporal perception, defined as the capability to detect and track objects across temporal sequences, serves as a fundamental component in autonomous driving systems. While single-vehicle perception systems encounter limitations, stemming from incomplete perception due to object occlusion and inherent blind spots, cooperative perception systems present their own challenges in terms of sensor calibration precision and positioning accuracy. To address these issues, we introduce LET-VIC, a LiDAR-based End-to-End Tracking framework for Vehicle-Infrastructure Cooperation (VIC). First, we employ Temporal Self-Attention and VIC Cross-Attention modules to effectively integrate temporal and spatial information from both vehicle and infrastructure perspectives. Then, we develop a novel Calibration Error Compensation (CEC) module to mitigate sensor misalignment issues and facilitate accurate feature alignment. Experiments on the V2X-Seq-SPD dataset demonstrate that LET-VIC significantly outperforms baseline models. Compared to LET-V, LET-VIC achieves +15.0% improvement in mAP and a +17.3% improvement in AMOTA. Furthermore, LET-VIC surpasses representative Tracking by Detection models, including V2VNet, FFNet, and PointPillars, with at least a +13.7% improvement in mAP and a +13.1% improvement in AMOTA without considering communication delays, showcasing its robust detection and tracking performance. The experiments demonstrate that the integration of multi-view perspectives, temporal sequences, or CEC in end-to-end training significantly improves both detection and tracking performance. All code will be open-sourced.