LS-HAR: Language Supervised Human Action Recognition with Salient Fusion, Construction Sites as a Use-Case

📅 2024-10-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the insufficient robustness of autonomous robots in recognizing human actions within construction environments, this paper proposes a language-supervised multimodal action recognition framework. Specifically, we design a learnable language-prompt-driven skeleton encoder to enable semantic-guided joint feature extraction, and develop an attention- and Transformer-based skeleton–vision–depth dual-modality saliency fusion module that adaptively focuses on discriminative frames and joints. Furthermore, we introduce VolvoConstAct—the first multimodal action dataset tailored to construction scenarios. Extensive experiments demonstrate state-of-the-art performance on VolvoConstAct, NTU-RGB+D, NTU-RGB+D 120, and NW-UCLA benchmarks. Our method significantly improves recognition accuracy under complex, dynamic conditions and enhances cross-scenario generalization capability.

Technology Category

Application Category

📝 Abstract
Detecting human actions is a crucial task for autonomous robots and vehicles, often requiring the integration of various data modalities for improved accuracy. In this study, we introduce a novel approach to Human Action Recognition (HAR) using language supervision named LS-HAR based on skeleton and visual cues. Our method leverages a language model to guide the feature extraction process in the skeleton encoder. Specifically, we employ learnable prompts for the language model conditioned on the skeleton modality to optimize feature representation. Furthermore, we propose a fusion mechanism that combines dual-modality features using a salient fusion module, incorporating attention and transformer mechanisms to address the modalities' high dimensionality. This fusion process prioritizes informative video frames and body joints, enhancing the recognition accuracy of human actions. Additionally, we introduce a new dataset tailored for real-world robotic applications in construction sites, featuring visual, skeleton, and depth data modalities, named VolvoConstAct. This dataset serves to facilitate the training and evaluation of machine learning models to instruct autonomous construction machines for performing necessary tasks in real-world construction sites. To evaluate our approach, we conduct experiments on our dataset as well as three widely used public datasets: NTU-RGB+D, NTU-RGB+D 120, and NW-UCLA. Results reveal that our proposed method achieves promising performance across all datasets, demonstrating its robustness and potential for various applications. The code, dataset, and demonstration of real-machine experiments are available at: https://mmahdavian.github.io/ls_har/
Problem

Research questions and friction points this paper is trying to address.

Improves human action recognition using language supervision.
Introduces a fusion mechanism for skeleton and visual data.
Develops a new dataset for construction site applications.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Language model guides skeleton feature extraction.
Salient fusion module combines dual-modality features.
New dataset VolvoConstAct for construction site applications.
🔎 Similar Papers
No similar papers found.