Speech Foundation Models Generalize to Time Series Tasks from Wearable Sensor Data

📅 2025-08-29

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

This study investigates the cross-modal generalization capability of speech foundation models—specifically HuBERT and wav2vec 2.0—in wearable sensor time-series tasks. Addressing low-data regimes in emotion classification, arrhythmia detection, and activity recognition, we propose reusing their pretrained convolutional feature encoders to extract generic time-frequency representations from sensor signals, without architectural modification. Our approach enables domain-agnostic representation learning, requiring only a lightweight linear probe and minimal labeled data for fine-tuning. Empirically, it substantially outperforms modality-specific self-supervised baselines across multiple benchmarks, achieving state-of-the-art (SOTA) performance. Notably, it demonstrates markedly improved robustness and generalization under extreme low-resource conditions (<1% labeled data), validating its efficacy in data-scarce scenarios. This work establishes a novel paradigm for developing universal time-series foundation models by leveraging pretrained speech representations, bridging the gap between audio and sensor modalities through shared temporal structure priors.

Technology Category

Application Category

📝 Abstract

Both speech and sensor time series data encode information in both the time- and frequency- domains, like spectral powers and waveform shapelets. We show that speech foundation models learn representations that are domain-independent and achieve state-of-the-art performance on time series tasks from wearable sensors. Probes trained on features extracted from HuBERT and wav2vec 2.0 outperform those extracted from self-supervised models trained directly on modality specific datasets for mood classification, arrhythmia detection, and activity classification tasks. We find a particularly strong relevance of the convolutional feature encoders from speech models for wearable sensor tasks. The methods proposed here improve performance and robustness for data-scarce time series tasks, using simple probing methods. This work is a step towards generalized time series models for speech and sensor data, a topic for further exploration.

Problem

Research questions and friction points this paper is trying to address.

Generalizing speech foundation models to wearable sensor time series tasks

Improving performance on mood classification and arrhythmia detection

Enhancing data-scarce time series tasks with domain-independent representations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Speech foundation models generalize to wearable sensors

Convolutional feature encoders from speech models transfer effectively

Simple probing methods improve performance on data-scarce tasks

🔎 Similar Papers

No similar papers found.