Speech Synthesis along Perceptual Voice Quality Dimensions

📅 2025-01-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the challenge of precise, continuous control over perceptual voice quality (PVQ) dimensions—such as roughness, breathiness, resonance, and vocal weight—in speech therapy and voice acting training. We propose the first end-to-end text-to-speech (TTS) framework integrating conditional continuous normalizing flows (Conditional CNFs) to learn a continuous, controllable PVQ representation directly from expert-annotated speech data, eliminating reliance on explicit acoustic parameter modeling or hand-crafted rules. Experiments demonstrate fine-grained, perceptually discernible, and quantitatively measurable PVQ control for both seen and unseen speakers. Clinical and pedagogical validity is confirmed through subjective evaluation by certified speech-language pathologists, verifying the method’s efficacy in therapeutic and vocal training applications.

Technology Category

Application Category

📝 Abstract
While expressive speech synthesis or voice conversion systems mainly focus on controlling or manipulating abstract prosodic characteristics of speech, such as emotion or accent, we here address the control of perceptual voice qualities (PVQs) recognized by phonetic experts, which are speech properties at a lower level of abstraction. The ability to manipulate PVQs can be a valuable tool for teaching speech pathologists in training or voice actors. In this paper, we integrate a Conditional Continuous-Normalizing-Flow-based method into a Text-to-Speech system to modify perceptual voice attributes on a continuous scale. Unlike previous approaches, our system avoids direct manipulation of acoustic correlates and instead learns from examples. We demonstrate the system's capability by manipulating four voice qualities: Roughness, breathiness, resonance and weight. Phonetic experts evaluated these modifications, both for seen and unseen speaker conditions. The results highlight both the system's strengths and areas for improvement.
Problem

Research questions and friction points this paper is trying to address.

Voice Quality Adjustment
Speech Acoustics
Vocal Training
Innovation

Methods, ideas, or system contributions that make the work stand out.

Conditional Continuous Regularized Flow
Voice Synthesis
Adjustable Acoustic Features
🔎 Similar Papers
No similar papers found.
F
Frederik Rautenberg
Paderborn University, Paderborn, Germany
M
Michael Kuhlmann
Paderborn University, Paderborn, Germany
F
Fritz Seebauer
Bielefeld University, Bielefeld, Germany
J
Jana Wiechmann
Bielefeld University, Bielefeld, Germany
Petra Wagner
Petra Wagner
Bielefeld University
prosodyspeech-based interactionmultimodal communicationspeech synthesis and evaluation
Reinhold Haeb-Umbach
Reinhold Haeb-Umbach
Professor of Communications Engineering, University of Paderborn
automatic speech recognitionspeech enhancementstatistical signal processing