🤖 AI Summary
Existing driver behavior datasets commonly lack precise object localization annotations and explicit action-object associations, limiting fine-grained behavior recognition. To address this, this work introduces DAOS, a multimodal, multi-view dataset, along with the Action-Object-Relation Network (AOR-Net). AOR-Net incorporates a novel action-object co-annotation paradigm, a chained action prompting mechanism, and a Mixture of Thoughts module to dynamically model human-object semantic relationships and focus on task-relevant contextual information. Experimental results demonstrate that AOR-Net significantly outperforms current methods in both object-rich and object-sparse scenarios, confirming its robustness and generalization capability in complex driving environments.
📝 Abstract
In driver activity monitoring, movements are mostly limited to the upper body, which makes many actions look similar. To tell these actions apart, human often rely on the objects the driver is using, such as holding a phone compared with gripping the steering wheel. However, most existing driver-monitoring datasets lack accurate object-location annotations or do not link objects to their associated actions, leaving a critical gap for reliable action recognition. To address this, we introduce the Driver Action with Object Synergy (DAOS) dataset, comprising 9,787 video clips annotated with 36 fine-grained driver actions and 15 object classes, totaling more than 2.5 million corresponding object instances. DAOS offers multi-modal, multi-view data (RGB, IR, and depth) from front, face, left, and right perspectives. Although DAOS captures a wide range of cabin objects, only a few are directly relevant to each action for prediction, so focusing on task-specific human-object relations is essential. To tackle this challenge, we propose the Action-Object-Relation Network (AOR-Net). AOR-Net comprehends complex driver actions through multi-level reasoning and a chain-of-action prompting mechanism that models the logical relationships among actions, objects, and their relations. Additionally, the Mixture of Thoughts module is introduced to dynamically select essential knowledge at each stage, enhancing robustness in object-rich and object-scarce conditions. Extensive experiments demonstrate that our model outperforms other state-of-the-art methods on various datasets.