V2X-VLM: End-to-End V2X Cooperative Autonomous Driving Through Large Vision-Language Models

📅 2024-08-17
🏛️ arXiv.org
📈 Citations: 14
Influential: 1
📄 PDF
🤖 AI Summary
To address the challenges of fusing heterogeneous perception inputs—onboard vision, roadside multimodal data, and natural language instructions—and insufficient robustness in trajectory planning within vehicle-infrastructure cooperative (V2X) systems, this paper proposes the first end-to-end V2X autonomous driving framework based on large vision-language models (VLMs). Methodologically, it unifies environmental understanding and trajectory generation into a single architecture, incorporates contrastive learning to enhance cross-modal representation robustness, and designs a multi-source heterogeneous data fusion module amenable to end-to-end training. The core contribution lies in the first systematic integration of VLMs into V2X cooperative driving, enabling semantic-level perception-decision integration. Evaluated on the DAIR-V2X benchmark, our approach significantly outperforms state-of-the-art methods. Comprehensive corner-case analysis further demonstrates superior generalization capability and robustness under real-world road deployment conditions.

Technology Category

Application Category

📝 Abstract
Advancements in autonomous driving have increasingly focused on end-to-end (E2E) systems that manage the full spectrum of driving tasks, from environmental perception to vehicle navigation and control. This paper introduces V2X-VLM, an innovative E2E vehicle-infrastructure cooperative autonomous driving (VICAD) framework with Vehicle-to-Everything (V2X) systems and large vision-language models (VLMs). V2X-VLM is designed to enhance situational awareness, decision-making, and ultimate trajectory planning by integrating multimodel data from vehicle-mounted cameras, infrastructure sensors, and textual information. The contrastive learning method is further employed to complement VLM by refining feature discrimination, assisting the model to learn robust representations of the driving environment. Evaluations on the DAIR-V2X dataset show that V2X-VLM outperforms state-of-the-art cooperative autonomous driving methods, while additional tests on corner cases validate its robustness in real-world driving conditions.
Problem

Research questions and friction points this paper is trying to address.

Overcoming perception limits in autonomous driving via V2X cooperation
Fusing visual and semantic data for robust trajectory planning
Enhancing driving safety with vision-language model integration
Innovation

Methods, ideas, or system contributions that make the work stand out.

End-to-end V2X framework with vision-language models
Contrastive learning for visual-textual alignment
Knowledge distillation for stable training