🤖 AI Summary
To address poor interpretability and weak cross-dataset generalization in wearable-sensor-based human activity recognition (HAR), this paper proposes the Motion-Primitive Transformer (MoPFormer). Methodologically, MoPFormer introduces, for the first time, a discrete representation of motion primitives—semantically meaningful and physically interpretable elementary motion units—derived from IMU signals. It integrates context-aware embedding with masked motion modeling pretraining within a Transformer architecture to enable primitive-level temporal dependency learning and self-supervised reconstruction. Evaluated on six HAR benchmarks, MoPFormer achieves state-of-the-art performance, with substantial improvements in cross-device and cross-scenario accuracy. Ablation studies and visualization analyses confirm that motion primitives exhibit high stability and transferability, simultaneously enhancing model interpretability and generalization robustness.
📝 Abstract
Human Activity Recognition (HAR) with wearable sensors is challenged by limited interpretability, which significantly impacts cross-dataset generalization. To address this challenge, we propose Motion-Primitive Transformer (MoPFormer), a novel self-supervised framework that enhances interpretability by tokenizing inertial measurement unit signals into semantically meaningful motion primitives and leverages a Transformer architecture to learn rich temporal representations. MoPFormer comprises two-stages. first stage is to partition multi-channel sensor streams into short segments and quantizing them into discrete"motion primitive"codewords, while the second stage enriches those tokenized sequences through a context-aware embedding module and then processes them with a Transformer encoder. The proposed MoPFormer can be pre-trained using a masked motion-modeling objective that reconstructs missing primitives, enabling it to develop robust representations across diverse sensor configurations. Experiments on six HAR benchmarks demonstrate that MoPFormer not only outperforms state-of-the-art methods but also successfully generalizes across multiple datasets. Most importantly, the learned motion primitives significantly enhance both interpretability and cross-dataset performance by capturing fundamental movement patterns that remain consistent across similar activities regardless of dataset origin.