MuJo: Multimodal Joint Feature Space Learning for Human Activity Recognition

📅 2024-06-06

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

224K/year

🤖 AI Summary

Human activity recognition (HAR) using wearable sensors faces challenges including sparse modality information and severe scarcity of labeled data. Method: We propose MuJo, a multimodal joint feature space learning framework that introduces the first cross-modal alignment pretraining paradigm integrating video, natural language, human pose, and synthetic IMU signals—leveraging contrastive learning, Transformer-based encoders, and self-supervised pretraining. We further construct FiMAD, the first open-source, multimodal aligned dataset tailored to the fitness domain. Results: On MM-Fit, MuJo achieves Macro F1 scores of 0.855 using only 2% labeled data and 0.942 with full supervision. It significantly outperforms existing self-supervised methods on real-world benchmarks—including MyoGym, MotionSense, and MHEALTH—demonstrating substantial improvements in both data efficiency and cross-dataset generalization.

Technology Category

Application Category

📝 Abstract

Human activity recognition (HAR) is a long-standing problem in artificial intelligence with applications in a broad range of areas, including healthcare, sports and fitness, security, and more. The performance of HAR in real-world settings is strongly dependent on the type and quality of the input signal that can be acquired. Given an unobstructed, high-quality camera view of a scene, computer vision systems, in particular in conjunction with foundation models, can today fairly reliably distinguish complex activities. On the other hand, recognition using modalities such as wearable sensors (which are often more broadly available, e.g., in mobile phones and smartwatches) is a more difficult problem, as the signals often contain less information and labeled training data is more difficult to acquire. To alleviate the need for labeled data, we introduce our comprehensive Fitness Multimodal Activity Dataset (FiMAD) in this work, which can be used with the proposed pre-training method MuJo (Multimodal Joint Feature Space Learning) to enhance HAR performance across various modalities. FiMAD was created using YouTube fitness videos and contains parallel video, language, pose, and simulated IMU sensor data. MuJo utilizes this dataset to learn a joint feature space for these modalities. We show that classifiers pre-trained on FiMAD can increase the performance on real HAR datasets such as MM-Fit, MyoGym, MotionSense, and MHEALTH. For instance, on MM-Fit, we achieve a Macro F1-Score of up to 0.855 when fine-tuning on only 2% of the training data and 0.942 when utilizing the complete training set for classification tasks. We compare our approach with other self-supervised ones and show that, unlike them, ours consistently improves compared to the baseline network performance while also providing better data efficiency.

Problem

Research questions and friction points this paper is trying to address.

Enhance human activity recognition across different modalities.

Reduce dependency on labeled data for training HAR models.

Improve classifier performance with multimodal joint feature learning.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal Joint Feature Space Learning

Pre-training with Fitness Multimodal Dataset

Enhanced HAR performance across modalities

🔎 Similar Papers

C3T: Cross-modal Transfer Through Time for Human Action Recognition