๐ค AI Summary
This work addresses the challenge of high-quality whisper-to-normal speech conversion under severe parallel data scarcity, exacerbated by the degraded acoustic cues in whispered speech due to the absence of vocal fold vibration and fundamental frequency. To this end, the authors propose WhispEar, a bidirectional conversion framework that jointly models speaker-invariant features of whisper and normal speech through a unified semantic representation. By simultaneously training whisper-to-normal (W2N) and normal-to-whisper (N2W) models, WhispEar leverages the N2W component to zero-shot generate pseudo-parallel whispered data from abundant normal speech, thereby augmenting W2N training. This approach enables scalable pseudo-data generation for the first time, substantially mitigating data scarcity, and introduces the largest bilingual (ChineseโEnglish) parallel corpus of whispered and normal speech to date. Experiments demonstrate that WhispEar significantly outperforms strong baselines, with performance consistently improving as the volume of generated data increases, confirming its effectiveness and scalability.
๐ Abstract
Whispered speech lacks vocal fold vibration and fundamental frequency, resulting in degraded acoustic cues and making whisper-to-normal (W2N) conversion challenging, especially with limited parallel data. We propose WhispEar, a bidirectional framework based on unified semantic representations that capture speaking-mode-invariant information shared by whispered and normal speech. The framework contains both W2N and normal-to-whisper (N2W) models. Notably, the N2W model enables zero-shot pseudo-parallel whisper generation from abundant normal speech, allowing scalable data augmentation for W2N training. Increasing generated data consistently improves performance. We also release the largest bilingual (Chinese-English) whispered-normal parallel corpus to date. Experiments demonstrate that WhispEar outperforms strong baselines and benefits significantly from scalable pseudo-parallel data.