Learning Transferable Sensor Models via Language-Informed Pretraining

📅 2026-03-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limited generalization of existing self-supervised sensor representation methods, which struggle to support downstream semantic tasks. To overcome this, the authors propose SLIP, a framework that leverages language-guided pretraining through contrastive alignment and sensor-conditioned caption generation to produce transferable representations across diverse sensor configurations. SLIP integrates a pretrained decoder-based language model, cross-attention mechanisms, and a flexible patch embedder, enabling it to handle variable-length, multi-resolution inputs without fine-tuning. The model achieves zero-shot cross-domain transfer and question answering directly out of the box. Evaluated on 11 datasets, SLIP attains an average linear-probe accuracy of 77.14%, representing a relative improvement of 5.93%, and achieves 64.83% accuracy on sensor-based question answering.

Technology Category

Application Category

📝 Abstract
Modern sensing systems generate large volumes of unlabeled multivariate time-series data. This abundance of unlabeled data makes self-supervised learning (SSL) a natural approach for learning transferable representations. However, most existing approaches are optimized for reconstruction or forecasting objectives and often fail to capture the semantic structure required for downstream classification and reasoning tasks. While recent sensor-language alignment methods improve semantic generalization through captioning and zero-shot transfer, they are limited to fixed sensor configurations, such as predefined channel sets, signal lengths, or temporal resolutions, which hinders cross-domain applicability. To address these gaps, we introduce \textbf{SLIP} (\textbf{S}ensor \textbf{L}anguage-\textbf{I}nformed \textbf{P}retraining), an open-source framework for learning language-aligned representations that generalize across diverse sensor setups. SLIP integrates contrastive alignment with sensor-conditioned captioning, facilitating both discriminative understanding and generative reasoning. By repurposing a pretrained decoder-only language model via cross-attention and introducing an elegant, flexible patch-embedder, SLIP supports different temporal resolutions and variable-length input at inference time without additional retraining. Across 11 datasets, SLIP demonstrates superior performance in zero-shot transfer, signal captioning, and question answering. It achieves a 77.14% average linear-probing accuracy, a 5.93% relative improvement over strong baselines, and reaches 64.83% accuracy in sensor-based question answering.
Problem

Research questions and friction points this paper is trying to address.

sensor models
self-supervised learning
semantic generalization
cross-domain applicability
time-series data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Sensor-Language Alignment
Self-Supervised Learning
Zero-Shot Transfer
Flexible Patch Embedding
Cross-Attention
🔎 Similar Papers
No similar papers found.
Yuliang Chen
Yuliang Chen
University of California, San Diego
Self-Supervised LearningMultimodal Learning
A
Arvind Pillai
Dartmouth College
Y
Yu Yvonne Wu
Dartmouth College
T
Tess Z. Griffin
Dartmouth College
Lisa Marsch
Lisa Marsch
Dartmouth College
M
Michael V. Heinz
Dartmouth College
Nicholas C. Jacobson
Nicholas C. Jacobson
Dartmouth College
Digital PhenotypingDigital InterventionsArtificial IntelligenceMental HealthChatbots
A
Andrew Campbell
Dartmouth College