🤖 AI Summary
To address the insufficient robustness of single-sensor perception for autonomous vehicles under adverse weather and complex urban conditions, this paper systematically investigates multimodal sensor fusion, unifying data-level, feature-level, and decision-level fusion paradigms within a coherent formalism. We propose a deep learning–based cross-modal alignment and representation learning framework that, for the first time, integrates vision-language models (VLMs) and large language models (LLMs) into the sensor fusion pipeline—thereby enhancing adaptability and uncertainty modeling in end-to-end autonomous driving. We establish a comprehensive evaluation framework across major benchmarks—including nuScenes, BDD100K, and Oxford Radar RobotCar—and demonstrate significant improvements in object detection and semantic segmentation accuracy under challenging conditions such as rain, fog, and nighttime.
📝 Abstract
Multi-sensor fusion plays a critical role in enhancing perception for autonomous driving, overcoming individual sensor limitations, and enabling comprehensive environmental understanding. This paper first formalizes multi-sensor fusion strategies into data-level, feature-level, and decision-level categories and then provides a systematic review of deep learning-based methods corresponding to each strategy. We present key multi-modal datasets and discuss their applicability in addressing real-world challenges, particularly in adverse weather conditions and complex urban environments. Additionally, we explore emerging trends, including the integration of Vision-Language Models (VLMs), Large Language Models (LLMs), and the role of sensor fusion in end-to-end autonomous driving, highlighting its potential to enhance system adaptability and robustness. Our work offers valuable insights into current methods and future directions for multi-sensor fusion in autonomous driving.