Vision-Language Models for Autonomous Driving: CLIP-Based Dynamic Scene Understanding

📅 2025-01-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the insufficient interpretability of dynamic driving scene understanding and decision-making in autonomous driving, this paper proposes a CLIP-enhanced, frame-level semantic understanding method tailored for edge-device deployment. We first deeply adapt ViT-L/14 and ViT-B/32 CLIP architectures to in-vehicle dynamic scene tasks, integrating contrastive learning, vision-language joint embedding, supervised fine-tuning (on the Honda Scenes dataset), and lightweight deployment optimization. The proposed method achieves zero-shot performance surpassing GPT-4o under real-world complex traffic conditions, with significantly improved few-shot generalization and robustness. On Honda Scenes, it attains 91.1% top-F1 accuracy while meeting core ADAS requirements: real-time inference (<50 ms), high accuracy, and high reliability. Moreover, it enables safety-critical decision support and human-factor-oriented interpretability analysis.

Technology Category

Application Category

📝 Abstract
Scene understanding is essential for enhancing driver safety, generating human-centric explanations for Automated Vehicle (AV) decisions, and leveraging Artificial Intelligence (AI) for retrospective driving video analysis. This study developed a dynamic scene retrieval system using Contrastive Language-Image Pretraining (CLIP) models, which can be optimized for real-time deployment on edge devices. The proposed system outperforms state-of-the-art in-context learning methods, including the zero-shot capabilities of GPT-4o, particularly in complex scenarios. By conducting frame-level analysis on the Honda Scenes Dataset, which contains a collection of about 80 hours of annotated driving videos capturing diverse real-world road and weather conditions, our study highlights the robustness of CLIP models in learning visual concepts from natural language supervision. Results also showed that fine-tuning the CLIP models, such as ViT-L/14 and ViT-B/32, significantly improved scene classification, achieving a top F1 score of 91.1%. These results demonstrate the ability of the system to deliver rapid and precise scene recognition, which can be used to meet the critical requirements of Advanced Driver Assistance Systems (ADAS). This study shows the potential of CLIP models to provide scalable and efficient frameworks for dynamic scene understanding and classification. Furthermore, this work lays the groundwork for advanced autonomous vehicle technologies by fostering a deeper understanding of driver behavior, road conditions, and safety-critical scenarios, marking a significant step toward smarter, safer, and more context-aware autonomous driving systems.
Problem

Research questions and friction points this paper is trying to address.

Dynamic Driving Environment
Autonomous Vehicles Safety
AI Video Analysis
Innovation

Methods, ideas, or system contributions that make the work stand out.

CLIP model
dynamic driving scenes recognition
autonomous driving safety
🔎 Similar Papers
No similar papers found.
Mohammed Elhenawy
Mohammed Elhenawy
Queensland University of Technology/CARRS-Q/Centre for Data Science
Huthaifa I. Ashqar
Huthaifa I. Ashqar
Arab American University
Machine LearningAIIntelligent Transportation SystemsConnected and Automated Vehicles
A
A. Rakotonirainy
CARRS-Q, Queensland University of Technology, Kelvin Grove QLD 4059, Australia
T
Taqwa I. Alhadidi
Civil Engineering Department, Al-Ahliyya Amman University, Amman 19328
Ahmed Jaber
Ahmed Jaber
Association of Palestinian Local Authorities
Transportation EngineeringRoad SafetyMicromobilityTravel BehaviourSDGs
M
M. Tami
Natural, Engineering and Technology Sciences Department, Arab American University, Jenin P.O Box 240, Palestine