🤖 AI Summary
This work addresses privacy concerns and environmental dependencies in vision-based human activity recognition for smart homes by proposing a camera-free approach that directly generates natural language descriptions of activities from heterogeneous signals such as wearable IMU and Wi-Fi. The method employs a unified sensor encoder to extract shared motion dynamics, integrating local temporal correlations and heterogeneous positional embeddings to construct a cohesive signal representation. An autoregressive Transformer decoder then produces open-ended, human-readable activity narratives, circumventing the limitations of predefined activity labels. Evaluated on multiple datasets—including XRF V2, UWash, and WiFiTAD—the approach achieves state-of-the-art performance, significantly outperforming existing baselines and demonstrating superior results on metrics such as BLEU@4, CIDEr, and RMC.
📝 Abstract
Human Activity Recognition (HAR) in smart homes is critical for health monitoring and assistive living. While vision-based systems are common, they face privacy concerns and environmental limitations (e.g., occlusion). In this work, we present MobiDiary, a framework that generates natural language descriptions of daily activities directly from heterogeneous physical signals (specifically IMU and Wi-Fi). Unlike conventional approaches that restrict outputs to pre-defined labels, MobiDiary produces expressive, human-readable summaries. To bridge the semantic gap between continuous, noisy physical signals and discrete linguistic descriptions, we propose a unified sensor encoder. Instead of relying on modality-specific engineering, we exploit the shared inductive biases of motion-induced signals--where both inertial and wireless data reflect underlying kinematic dynamics. Specifically, our encoder utilizes a patch-based mechanism to capture local temporal correlations and integrates heterogeneous placement embedding to unify spatial contexts across different sensors. These unified signal tokens are then fed into a Transformer-based decoder, which employs an autoregressive mechanism to generate coherent action descriptions word-by-word. We comprehensively evaluate our approach on multiple public benchmarks (XRF V2, UWash, and WiFiTAD). Experimental results demonstrate that MobiDiary effectively generalizes across modalities, achieving state-of-the-art performance on captioning metrics (e.g., BLEU@4, CIDEr, RMC) and outperforming specialized baselines in continuous action understanding.