Spatial-Temporal Perception with Causal Inference for Naturalistic Driving Action Recognition

📅 2025-03-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenges of fine-grained driving behavior recognition and weak temporal modeling in real-world cockpit scenarios, this paper proposes a unimodal (RGB) spatiotemporal-aware architecture. Methodologically, we design a causal decoder to explicitly model causal dependencies among frame-wise features; introduce a permutation-invariant maximum-likelihood encoding mechanism to jointly optimize spatial and temporal representations; and incorporate multi-scale spatiotemporal distance feature extraction and fusion. Our core contribution is the first integration of causal inference into spatiotemporal joint modeling for driver behavior recognition—without relying on multimodal inputs. Evaluated on two mainstream distracted driving detection benchmarks, our approach achieves state-of-the-art performance, significantly improving both fine-action discrimination accuracy and temporal localization precision of driving behaviors.

Technology Category

Application Category

📝 Abstract
Naturalistic driving action recognition is essential for vehicle cabin monitoring systems. However, the complexity of real-world backgrounds presents significant challenges for this task, and previous approaches have struggled with practical implementation due to their limited ability to observe subtle behavioral differences and effectively learn inter-frame features from video. In this paper, we propose a novel Spatial-Temporal Perception (STP) architecture that emphasizes both temporal information and spatial relationships between key objects, incorporating a causal decoder to perform behavior recognition and temporal action localization. Without requiring multimodal input, STP directly extracts temporal and spatial distance features from RGB video clips. Subsequently, these dual features are jointly encoded by maximizing the expected likelihood across all possible permutations of the factorization order. By integrating temporal and spatial features at different scales, STP can perceive subtle behavioral changes in challenging scenarios. Additionally, we introduce a causal-aware module to explore relationships between video frame features, significantly enhancing detection efficiency and performance. We validate the effectiveness of our approach using two publicly available driver distraction detection benchmarks. The results demonstrate that our framework achieves state-of-the-art performance.
Problem

Research questions and friction points this paper is trying to address.

Recognize naturalistic driving actions for vehicle cabin monitoring.
Address challenges in real-world background complexity and subtle behavior detection.
Enhance temporal and spatial feature learning from RGB video clips.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Spatial-Temporal Perception architecture for action recognition
Causal decoder for behavior and temporal action localization
Causal-aware module enhances detection efficiency and performance
🔎 Similar Papers
No similar papers found.