FlowW2N: Whispered-to-Normal Speech Conversion via Flow-Matching

πŸ“… 2026-03-04
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

220K/year
πŸ€– AI Summary
This work addresses the challenge of converting whispered speech to normal speech in real-world scenarios, where utterances are temporally misaligned and paired training data are unavailable. The authors propose FlowW2N, a method based on conditional flow matching that leverages only synthetic, time-aligned whisper–normal speech pairs for training. By incorporating domain-invariant high-level embeddings from an automatic speech recognition (ASR) model as conditioning signals, FlowW2N significantly enhances generalization to real whispered speech. Through systematic evaluation, the authors identify ASR embedding layers that exhibit strong cross-domain invariance and rich content information, enabling FlowW2N to achieve state-of-the-art performance on the CHAINS and wTIMIT datasets, with relative word error rate reductions of 26%–46%. The model requires only 10 inference steps and operates without any real paired training data.

Technology Category

Application Category

πŸ“ Abstract
Whispered-to-normal (W2N) speech conversion aims to reconstruct missing phonation from whispered input while preserving content and speaker identity. This task is challenging due to temporal misalignment between whisper and voiced recordings and lack of paired data. We propose FlowW2N, a conditional flow matching approach that trains exclusively on synthetic, time-aligned whisper-normal pairs and conditions on domain-invariant features. We exploit high-level ASR embeddings that exhibits strong invariance between synthetic and real whispered speech, enabling generalization to real whispers despite never observing it during training. We verify this invariance across ASR layers and propose a selection criterion optimizing content informativeness and cross-domain invariance. Our method achieves SOTA intelligibility on the CHAINS and wTIMIT datasets, reducing Word Error Rate by 26-46% relative to prior work while using only 10 steps at inference and requiring no real paired data.
Problem

Research questions and friction points this paper is trying to address.

Whispered-to-normal speech conversion
phonation reconstruction
temporal misalignment
unpaired data
speaker identity preservation
Innovation

Methods, ideas, or system contributions that make the work stand out.

flow matching
whispered-to-normal speech conversion
domain-invariant features
ASR embeddings
synthetic data training
πŸ”Ž Similar Papers
No similar papers found.