WhispEar: A Bi-directional Framework for Scaling Whispered Speech Conversion via Pseudo-Parallel Whisper Generation

๐Ÿ“… 2026-03-09
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses the challenge of high-quality whisper-to-normal speech conversion under severe parallel data scarcity, exacerbated by the degraded acoustic cues in whispered speech due to the absence of vocal fold vibration and fundamental frequency. To this end, the authors propose WhispEar, a bidirectional conversion framework that jointly models speaker-invariant features of whisper and normal speech through a unified semantic representation. By simultaneously training whisper-to-normal (W2N) and normal-to-whisper (N2W) models, WhispEar leverages the N2W component to zero-shot generate pseudo-parallel whispered data from abundant normal speech, thereby augmenting W2N training. This approach enables scalable pseudo-data generation for the first time, substantially mitigating data scarcity, and introduces the largest bilingual (Chineseโ€“English) parallel corpus of whispered and normal speech to date. Experiments demonstrate that WhispEar significantly outperforms strong baselines, with performance consistently improving as the volume of generated data increases, confirming its effectiveness and scalability.

Technology Category

Application Category

๐Ÿ“ Abstract
Whispered speech lacks vocal fold vibration and fundamental frequency, resulting in degraded acoustic cues and making whisper-to-normal (W2N) conversion challenging, especially with limited parallel data. We propose WhispEar, a bidirectional framework based on unified semantic representations that capture speaking-mode-invariant information shared by whispered and normal speech. The framework contains both W2N and normal-to-whisper (N2W) models. Notably, the N2W model enables zero-shot pseudo-parallel whisper generation from abundant normal speech, allowing scalable data augmentation for W2N training. Increasing generated data consistently improves performance. We also release the largest bilingual (Chinese-English) whispered-normal parallel corpus to date. Experiments demonstrate that WhispEar outperforms strong baselines and benefits significantly from scalable pseudo-parallel data.
Problem

Research questions and friction points this paper is trying to address.

whispered speech
speech conversion
parallel data
acoustic cues
fundamental frequency
Innovation

Methods, ideas, or system contributions that make the work stand out.

whispered speech conversion
pseudo-parallel data generation
bidirectional framework
zero-shot synthesis
speech representation
๐Ÿ”Ž Similar Papers
No similar papers found.
Z
Zihao Fang
School of Data Science, The Chinese University of Hong Kong, Shenzhen, China
Y
Yingda Shen
School of Data Science, The Chinese University of Hong Kong, Shenzhen, China
Z
Zifan Guan
School of Data Science, The Chinese University of Hong Kong, Shenzhen, China
T
Tongtong Song
Honor Device Co., Ltd, China
Zhenyi Liu
Zhenyi Liu
Texas Tech University
Zhizheng Wu
Zhizheng Wu
The Chinese University of Hong Kong, Shenzhen (CUHK-Shenzhen), Mel Lab
Spoken Language ProcessingDeepFake detectionMusic Processing