🤖 AI Summary
Speaker-dependent models for voice-based health monitoring face a trade-off between accuracy and deployment efficiency, hindering real-time, scalable applications. Method: This paper pioneers the application of meta-learning to voice health monitoring, proposing a novel paradigm for individualized, dynamic modeling of vocal fatigue—specifically, predicting time elapsed since last sleep—from speech. We design three progressive meta-learning architectures—integrated distance-based, prototypical, and Transformer-based sequence models—leveraging pretrained speech embeddings and enabling few-shot adaptation on longitudinal real-world data. Contribution/Results: Evaluated on a large-scale dataset comprising over 10,000 utterances from 1,185 shift workers, all meta-learning approaches significantly outperform conventional cross-sectional models and traditional mixed-effects models; the Transformer-based method achieves state-of-the-art performance. Crucially, our framework eliminates the need for per-subject retraining inherent in mixed-effects modeling, delivering a scalable, low-latency paradigm for personalized voice-based health monitoring.
📝 Abstract
Speaker-dependent modelling can substantially improve performance in speech-based health monitoring applications. While mixed-effect models are commonly used for such speaker adaptation, they require computationally expensive retraining for each new observation, making them impractical in a production environment. We reformulate this task as a meta-learning problem and explore three approaches of increasing complexity: ensemble-based distance models, prototypical networks, and transformer-based sequence models. Using pre-trained speech embeddings, we evaluate these methods on a large longitudinal dataset of shift workers (N=1,185, 10,286 recordings), predicting time since sleep from speech as a function of fatigue, a symptom commonly associated with ill-health. Our results demonstrate that all meta-learning approaches tested outperformed both cross-sectional and conventional mixed-effects models, with a transformer-based method achieving the strongest performance.