🤖 AI Summary
This study investigates the cross-modal generalization capability of speech foundation models—specifically HuBERT and wav2vec 2.0—in wearable sensor time-series tasks. Addressing low-data regimes in emotion classification, arrhythmia detection, and activity recognition, we propose reusing their pretrained convolutional feature encoders to extract generic time-frequency representations from sensor signals, without architectural modification. Our approach enables domain-agnostic representation learning, requiring only a lightweight linear probe and minimal labeled data for fine-tuning. Empirically, it substantially outperforms modality-specific self-supervised baselines across multiple benchmarks, achieving state-of-the-art (SOTA) performance. Notably, it demonstrates markedly improved robustness and generalization under extreme low-resource conditions (<1% labeled data), validating its efficacy in data-scarce scenarios. This work establishes a novel paradigm for developing universal time-series foundation models by leveraging pretrained speech representations, bridging the gap between audio and sensor modalities through shared temporal structure priors.
📝 Abstract
Both speech and sensor time series data encode information in both the time- and frequency- domains, like spectral powers and waveform shapelets. We show that speech foundation models learn representations that are domain-independent and achieve state-of-the-art performance on time series tasks from wearable sensors. Probes trained on features extracted from HuBERT and wav2vec 2.0 outperform those extracted from self-supervised models trained directly on modality specific datasets for mood classification, arrhythmia detection, and activity classification tasks. We find a particularly strong relevance of the convolutional feature encoders from speech models for wearable sensor tasks. The methods proposed here improve performance and robustness for data-scarce time series tasks, using simple probing methods. This work is a step towards generalized time series models for speech and sensor data, a topic for further exploration.