🤖 AI Summary
Driver Monitoring Systems (DMS) lack systematic investigation into Vision-Language Models (VLMs), particularly zero-shot applications. Method: This work pioneers the zero-shot adaptation of large multimodal models—including LLaVA and Qwen-VL—to DMS via driving-specific prompt engineering, eliminating the need for fine-tuning or task-specific architectural modifications. Contribution/Results: Evaluated on the Driver Monitoring Dataset, our approach achieves superior zero-shot performance over conventional supervised models in critical tasks such as fatigue and distraction detection. Notably, VLMs demonstrate exceptional semantic comprehension and cross-scenario generalization—capabilities inherently limited in traditional DMS pipelines. By circumventing reliance on large-scale annotated data and handcrafted feature engineering, this study establishes a new paradigm for lightweight, interpretable, and open-domain in-cabin perception. It bridges a key gap between foundation models and real-world automotive vision applications, offering a scalable framework for next-generation intelligent cockpit systems.
📝 Abstract
In recent years, we have witnessed significant progress in emerging deep learning models, particularly Large Language Models (LLMs) and Vision-Language Models (VLMs). These models have demonstrated promising results, indicating a new era of Artificial Intelligence (AI) that surpasses previous methodologies. Their extensive knowledge and zero-shot capabilities suggest a paradigm shift in developing deep learning solutions, moving from data capturing and algorithm training to just writing appropriate prompts. While the application of these technologies has been explored across various industries, including automotive, there is a notable gap in the scientific literature regarding their use in Driver Monitoring Systems (DMS). This paper presents our initial approach to implementing VLMs in this domain, utilising the Driver Monitoring Dataset to evaluate their performance and discussing their advantages and challenges when implemented in real-world scenarios.