🤖 AI Summary
Large language models (LLMs) struggle to process raw motion sensor time-series data due to semantic sparsity, numerical input incompatibility, and computational constraints. To address this, we propose SensorLLM—a two-stage sensor-to-language alignment framework. Its core contributions are: (1) channel-specific special tokens coupled with auto-generated trend-oriented textual descriptions, enabling semantic encoding of multichannel, variable-length numeric sequences; and (2) an integrated pipeline combining textualized sequence representation, special token embedding, instruction tuning, and task-aware LoRA adaptation—enabling zero-shot human activity recognition (HAR). Evaluated across multiple benchmarks, SensorLLM achieves or surpasses state-of-the-art performance, demonstrating high accuracy, cross-device transferability, and strong generalization capability.
📝 Abstract
We introduce SensorLLM, a two-stage framework that enables Large Language Models (LLMs) to perform human activity recognition (HAR) from sensor data. Despite their strong reasoning and generalization capabilities, LLMs remain underutilized for motion sensor data due to the lack of semantic context in time-series, computational constraints, and challenges in processing numerical inputs. SensorLLM addresses these limitations through a Sensor-Language Alignment stage, where we introduce special tokens for each sensor channel and automatically generate textual trend descriptions. This alignment enables LLMs to capture numerical variations, channel-specific features, and data of varying duration--without requiring human annotations. In the subsequent Task-Aware Tuning stage, we refine the model for HAR classification, achieving performance that matches or surpasses state-of-the-art methods. Our results demonstrate that SensorLLM evolves into an effective sensor learner, reasoner, and classifier through Sensor-Language Alignment, generalizing across diverse HAR datasets. We believe this work establishes a foundation for future research on time-series and text alignment, paving the way for foundation models in sensor data analysis.