Vocoder-Free Non-Parallel Conversion of Whispered Speech With Masked Cycle-Consistent Generative Adversarial Networks

📅 2023-06-10

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 1

career value

235K/year

🤖 AI Summary

This work addresses whisper-to-normal speech conversion under zero-shot, non-parallel data conditions. We propose the first end-to-end waveform generation framework that jointly models acoustic feature mapping and waveform synthesis within a single unified architecture—eliminating the need for an external vocoder. Methodologically, we integrate cycle-consistent generative adversarial training, masked modeling, and self-supervised auxiliary tasks, introducing a novel masked cycle-consistency loss to enhance cross-domain structural alignment. Experiments demonstrate significant improvements: +6.7 MOS in subjective whisper conversion quality and +0.5–2.4 in MOS prediction accuracy for conventional speech conversion, outperforming all existing baselines. Our core contributions are threefold: (1) the first end-to-end whisper conversion system operating without parallel data or external vocoders; (2) empirical validation that multi-task joint modeling effectively improves implicit speech representation learning; and (3) a principled framework for unsupervised cross-domain waveform generation.

📝 Abstract

Cycle-consistent generative adversarial networks have been widely used in non-parallel voice conversion (VC). Their ability to learn mappings between source and target features without relying on parallel training data eliminates the need for temporal alignments. However, most methods decouple the conversion of acoustic features from synthesizing the audio signal by using separate models for conversion and waveform synthesis. This work unifies conversion and synthesis into a single model, thereby eliminating the need for a separate vocoder. By leveraging cycle-consistent training and a self-supervised auxiliary training task, our model is able to efficiently generate converted high-quality raw audio waveforms. Subjective listening tests show that our method outperforms the baseline in whispered speech conversion (up to 6.7% relative improvement), and mean opinion score predictions yield competitive results in conventional VC (between 0.5% and 2.4% relative improvement).

Problem

Research questions and friction points this paper is trying to address.

Eliminates need for separate vocoder in speech conversion

Unifies feature conversion and waveform synthesis in one model

Improves whispered speech conversion quality by 6.7%

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified model for conversion and synthesis

Vocoder-free with masked cycle-consistent GANs

Self-supervised auxiliary task enhances quality

🔎 Similar Papers

No similar papers found.