Cross-View Cross-Modal Unsupervised Domain Adaptation for Driver Monitoring System

πŸ“… 2025-11-15
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Driver distraction recognition faces dual domain shift challenges in real-world deployment: cross-view shifts (due to camera placement variations) and cross-modal shifts (caused by sensor or environmental changes). Existing methods typically address these issues separately, limiting generalizability and scalability. This paper proposes the first two-stage framework for joint unsupervised domain adaptation across both view and modality. Stage one employs contrastive learning to extract view-invariant yet action-discriminative spatiotemporal features. Stage two introduces an information bottleneck loss to achieve target-domain label-free domain alignment. Evaluated on the Drive&Act dataset using video Transformers (e.g., Video Swin, MViT), our method achieves a Top-1 accuracy of 89.2% under RGB inputβ€”nearly 50% higher than supervised contrastive baselines and up to 5% higher than state-of-the-art single-shift adaptation methods. The framework significantly enhances robustness and deployability in realistic driving scenarios.

Technology Category

Application Category

πŸ“ Abstract
Driver distraction remains a leading cause of road traffic accidents, contributing to thousands of fatalities annually across the globe. While deep learning-based driver activity recognition methods have shown promise in detecting such distractions, their effectiveness in real-world deployments is hindered by two critical challenges: variations in camera viewpoints (cross-view) and domain shifts such as change in sensor modality or environment. Existing methods typically address either cross-view generalization or unsupervised domain adaptation in isolation, leaving a gap in the robust and scalable deployment of models across diverse vehicle configurations. In this work, we propose a novel two-phase cross-view, cross-modal unsupervised domain adaptation framework that addresses these challenges jointly on real-time driver monitoring data. In the first phase, we learn view-invariant and action-discriminative features within a single modality using contrastive learning on multi-view data. In the second phase, we perform domain adaptation to a new modality using information bottleneck loss without requiring any labeled data from the new domain. We evaluate our approach using state-of-the art video transformers (Video Swin, MViT) and multi modal driver activity dataset called Drive&Act, demonstrating that our joint framework improves top-1 accuracy on RGB video data by almost 50% compared to a supervised contrastive learning-based cross-view method, and outperforms unsupervised domain adaptation-only methods by up to 5%, using the same video transformer backbone.
Problem

Research questions and friction points this paper is trying to address.

Addresses driver distraction detection across varying camera viewpoints
Tackles domain shifts from sensor modality and environment changes
Enables robust model deployment across diverse vehicle configurations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-phase cross-view cross-modal domain adaptation
Contrastive learning for view-invariant feature extraction
Information bottleneck loss for unlabeled modality adaptation
πŸ”Ž Similar Papers
No similar papers found.
A
Aditi Bhalla
School of Social Sciences and Technology, Technical University Munich, Germany
C
Christian Hellert
Aumovio SE, Germany
Enkelejda Kasneci
Enkelejda Kasneci
Professor at the Technical University of Munich
Eye TrackingAI in EducationHuman-Centered AIComputational InteractionHCI