MobiDiary: Autoregressive Action Captioning with Wearable Devices and Wireless Signals

📅 2026-01-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses privacy concerns and environmental dependencies in vision-based human activity recognition for smart homes by proposing a camera-free approach that directly generates natural language descriptions of activities from heterogeneous signals such as wearable IMU and Wi-Fi. The method employs a unified sensor encoder to extract shared motion dynamics, integrating local temporal correlations and heterogeneous positional embeddings to construct a cohesive signal representation. An autoregressive Transformer decoder then produces open-ended, human-readable activity narratives, circumventing the limitations of predefined activity labels. Evaluated on multiple datasets—including XRF V2, UWash, and WiFiTAD—the approach achieves state-of-the-art performance, significantly outperforming existing baselines and demonstrating superior results on metrics such as BLEU@4, CIDEr, and RMC.

Technology Category

Application Category

📝 Abstract
Human Activity Recognition (HAR) in smart homes is critical for health monitoring and assistive living. While vision-based systems are common, they face privacy concerns and environmental limitations (e.g., occlusion). In this work, we present MobiDiary, a framework that generates natural language descriptions of daily activities directly from heterogeneous physical signals (specifically IMU and Wi-Fi). Unlike conventional approaches that restrict outputs to pre-defined labels, MobiDiary produces expressive, human-readable summaries. To bridge the semantic gap between continuous, noisy physical signals and discrete linguistic descriptions, we propose a unified sensor encoder. Instead of relying on modality-specific engineering, we exploit the shared inductive biases of motion-induced signals--where both inertial and wireless data reflect underlying kinematic dynamics. Specifically, our encoder utilizes a patch-based mechanism to capture local temporal correlations and integrates heterogeneous placement embedding to unify spatial contexts across different sensors. These unified signal tokens are then fed into a Transformer-based decoder, which employs an autoregressive mechanism to generate coherent action descriptions word-by-word. We comprehensively evaluate our approach on multiple public benchmarks (XRF V2, UWash, and WiFiTAD). Experimental results demonstrate that MobiDiary effectively generalizes across modalities, achieving state-of-the-art performance on captioning metrics (e.g., BLEU@4, CIDEr, RMC) and outperforming specialized baselines in continuous action understanding.
Problem

Research questions and friction points this paper is trying to address.

Human Activity Recognition
Action Captioning
Wearable Sensors
Wi-Fi Sensing
Natural Language Generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

autoregressive captioning
unified sensor encoder
heterogeneous signal fusion
motion-induced inductive bias
wearable and Wi-Fi sensing
🔎 Similar Papers
No similar papers found.
Fei Deng
Fei Deng
Research Scientist, Google
Diffusion ModelsRLHFReinforcement LearningGenerative ModelsObject-Centric Learning
Yinghui He
Yinghui He
PhD student, Princeton University
Natural Language Processing
C
Chuntong Chu
School of Software Engineering, Xi’an Jiaotong University, Xi’an 710049, China
Ge Wang
Ge Wang
Associate Professor of Music (also Computer Science), Stanford University
Artful DesignComputer MusicInteraction DesignLaptop OrchestraMusic Programming Language Design
H
Han Ding
School of Computer Science and Technology, Xi’an Jiaotong University, Xi’an 710049, China
J
Jinsong Han
College of Computer Science and Technology, Zhejiang University, Hangzhou 310027, China
Fei Wang
Fei Wang
Xi'an Jiaotong University
computer visionartificial intelligence