🤖 AI Summary
Existing speech enhancement models exhibit severely limited cross-speaker generalization on pathological speech, such as that of Parkinson’s disease patients. This paper proposes a few-shot personalized enhancement method: leveraging a pretrained VAE-NMF hybrid model, it requires only a few seconds of clean speaker-specific speech for fine-tuning—marking the first application of personalized fine-tuning to pathological speech enhancement. The approach significantly narrows the performance gap between pathological and neurotypical speakers, achieving a 4.2 dB improvement in signal-to-noise ratio (SNR) gain, while outperforming both generic and conventionally fine-tuned models on both speaker groups. Its core contribution lies in establishing a lightweight, rapidly adaptable personalized enhancement framework that effectively addresses the generalization bottleneck arising from high modeling bias and severe data scarcity in pathological speech.
📝 Abstract
The generalizability of speech enhancement (SE) models across speaker conditions remains largely unexplored, despite its critical importance for broader applicability. This paper investigates the performance of the hybrid variational autoencoder (VAE)-non-negative matrix factorization (NMF) model for SE, focusing primarily on its generalizability to pathological speakers with Parkinson's disease. We show that VAE models trained on large neurotypical datasets perform poorly on pathological speech. While fine-tuning these pre-trained models with pathological speech improves performance, a performance gap remains between neurotypical and pathological speakers. To address this gap, we propose using personalized SE models derived from fine-tuning pre-trained models with only a few seconds of clean data from each speaker. Our results demonstrate that personalized models considerably enhance performance for all speakers, achieving comparable results for both neurotypical and pathological speakers.