Zero-Shot Distracted Driver Detection via Vision Language Models with Double Decoupling

📅 2026-01-13

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

This work addresses the susceptibility of existing vision-language models to identity-related confounders—such as clothing, age, and gender—in distracted driving detection, which often leads models to rely on driver appearance rather than actual behavior. To mitigate this, the authors propose a dual disentanglement framework: first, appearance embeddings are explicitly stripped from visual inputs to focus representations on behavioral cues; second, textual embeddings are orthogonalized on the Stiefel manifold to enhance semantic discriminability among categories. This approach achieves the first explicit decoupling of appearance and behavior while leveraging manifold-constrained text embedding optimization to reduce identity bias and substantially improve zero-shot generalization. Extensive experiments demonstrate that the method outperforms current state-of-the-art approaches across multiple benchmarks, confirming its effectiveness and practicality in real-world driving scenarios.

Technology Category

Application Category

📝 Abstract

Distracted driving is a major cause of traffic collisions, calling for robust and scalable detection methods. Vision-language models (VLMs) enable strong zero-shot image classification, but existing VLM-based distracted driver detectors often underperform in real-world conditions. We identify subject-specific appearance variations (e.g., clothing, age, and gender) as a key bottleneck: VLMs entangle these factors with behavior cues, leading to decisions driven by who the driver is rather than what the driver is doing. To address this, we propose a subject decoupling framework that extracts a driver appearance embedding and removes its influence from the image embedding prior to zero-shot classification, thereby emphasizing distraction-relevant evidence. We further orthogonalize text embeddings via metric projection onto Stiefel manifold to improve separability while staying close to the original semantics. Experiments demonstrate consistent gains over prior baselines, indicating the promise of our approach for practical road-safety applications.

Problem

Research questions and friction points this paper is trying to address.

distracted driver detection

vision-language models

zero-shot classification

appearance variation

behavior recognition

Innovation

Methods, ideas, or system contributions that make the work stand out.

zero-shot learning

vision-language models

subject decoupling