SensorLM: Learning the Language of Wearable Sensors

📅 2025-06-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Aligning wearable sensor signals with natural language remains challenging due to semantic heterogeneity and the scarcity of large-scale, high-quality paired sensor–language annotations. Method: We introduce the first multimodal foundation model for wearable sensing. (1) We design a hierarchical captioning pipeline to construct the largest real-world sensor–language dataset to date—59.7 million hours of sensor data from over 100,000 individuals, each paired with descriptive natural language captions; (2) We generalize existing contrastive and generative multimodal architectures (e.g., CLIP, CoCa) into a unified, scalable cross-modal alignment pretraining framework. Contribution/Results: Our model achieves state-of-the-art performance on human activity recognition and healthcare tasks, enabling zero-shot classification, few-shot transfer, and cross-modal retrieval. It demonstrates label efficiency, strong generalization to unseen tasks, and robustness under limited supervision—establishing a new foundation for language-grounded wearable sensing.

Technology Category

Application Category

📝 Abstract
We present SensorLM, a family of sensor-language foundation models that enable wearable sensor data understanding with natural language. Despite its pervasive nature, aligning and interpreting sensor data with language remains challenging due to the lack of paired, richly annotated sensor-text descriptions in uncurated, real-world wearable data. We introduce a hierarchical caption generation pipeline designed to capture statistical, structural, and semantic information from sensor data. This approach enabled the curation of the largest sensor-language dataset to date, comprising over 59.7 million hours of data from more than 103,000 people. Furthermore, SensorLM extends prominent multimodal pretraining architectures (e.g., CLIP, CoCa) and recovers them as specific variants within a generic architecture. Extensive experiments on real-world tasks in human activity analysis and healthcare verify the superior performance of SensorLM over state-of-the-art in zero-shot recognition, few-shot learning, and cross-modal retrieval. SensorLM also demonstrates intriguing capabilities including scaling behaviors, label efficiency, sensor captioning, and zero-shot generalization to unseen tasks.
Problem

Research questions and friction points this paper is trying to address.

Aligning sensor data with language lacks annotated datasets
Generating hierarchical captions for sensor data understanding
Improving recognition and retrieval in wearable sensor tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical caption generation for sensor data
Multimodal pretraining architectures extension
Largest sensor-language dataset curation
🔎 Similar Papers
No similar papers found.
Y
Yuwei Zhang
Google Research
Kumar Ayush
Kumar Ayush
Google | Stanford University | Indian Institute of Technology Kharagpur
Foundation ModelsLarge Language ModelsGenerative AIRLHF
S
Siyuan Qiao
Google DeepMind
A
A. Ali Heydari
Google Research
Girish Narayanswamy
Girish Narayanswamy
UbiComp Lab, University of Washington
Health SensingSignal ProcessingMachine LearningArtificial IntelligenceEmbedded Systems
M
Maxwell A. Xu
Google Research
A
Ahmed A. Metwally
Google Research
Shawn Xu
Shawn Xu
Google LLC
Machine LearningComputer VisionArtificial Intelligence
J
Jake Garrison
Google Research
Xuhai Xu
Xuhai Xu
Assistant Professor, Columbia University | Google
Human-Computer InteractionUbiquitous ComputingHuman-Centered AImHealthHealth Informatics
Tim Althoff
Tim Althoff
Associate Professor of Computer Science, University of Washington
Human AI InteractionNatural Language ProcessingBehavioral Data ScienceAI for Mental Health
Y
Yun Liu
Google Research
Pushmeet Kohli
Pushmeet Kohli
DeepMind
AI for ScienceMachine LearningAI SafetyComputer VisionProgram Synthesis
J
Jiening Zhan
Google Research
M
Mark Malhotra
Google Research
Shwetak Patel
Shwetak Patel
University of Washington, Washington Research Foundation Endowed Professor, Computer Science
Ubiquitous ComputingHuman-Computer InteractionSensorsEmbedded Systems
Cecilia Mascolo
Cecilia Mascolo
University of Cambridge
Mobile SystemsMobile HealthWearable Data Machine LearningOn Device Machine Learning
X
Xin Liu
Google Research
Daniel McDuff
Daniel McDuff
Google and University of Washington
Affective ComputingDeep LearningHuman-Computer InteractionHuman-Centered AIComputer Vision
Y
Yuzhe Yang
Google Research