🤖 AI Summary
Existing wearable-based human activity recognition methods rely on predefined closed-set categories, limiting their ability to handle open-ended, personalized, and combinatorial activities encountered in real-world scenarios. This work proposes a novel open-domain paradigm centered on natural language activity narratives. By constructing multimodal data from multi-position wearable sensors paired with temporally aligned free-text descriptions, the authors design a language-conditioned neural architecture and introduce a retrieval-based evaluation protocol that does not require fixed activity categories. The resulting framework unifies open-vocabulary understanding with traditional closed-set recognition. In cross-participant evaluations, it achieves a Macro-F1 score of 65.3%, substantially outperforming strong baselines (31–34%), thereby demonstrating its effectiveness and robustness.
📝 Abstract
Wearable HAR has improved steadily, but most progress still relies on closed-set classification, which limits real-world use. In practice, human activity is open-ended, unscripted, personalized, and often compositional, unfolding as narratives rather than instances of fixed classes. We argue that addressing this gap does not require simply scaling datasets or models. It requires a fundamental shift in how wearable HAR is formulated, supervised, and evaluated. This work shows how to model open-ended activity narratives by aligning wearable sensor data with natural-language descriptions in an open-vocabulary setting. Our framework has three core components. First, we introduce a naturalistic data collection and annotation pipeline that combines multi-position wearable sensing with free-form, time-aligned narrative descriptions of ongoing behavior, allowing activity semantics to emerge without a predefined vocabulary. Second, we define a retrieval-based evaluation framework that measures semantic alignment between sensor data and language, enabling principled evaluation without fixed classes while also subsuming closed-set classification as a special case. Third, we present a language-conditioned learning architecture that supports sensor-to-text inference over variable-length sensor streams and heterogeneous sensor placements. Experiments show that models trained with fixed-label objectives degrade sharply under real-world variability, while open-vocabulary sensor-language alignment yields robust and semantically grounded representations. Once this alignment is learned, closed-set activity recognition becomes a simple downstream task. Under cross-participant evaluation, our method achieves 65.3% Macro-F1, compared with 31-34% for strong closed-set HAR baselines. These results establish open-ended narrative modeling as a practical and effective foundation for real-world wearable HAR.