Speech Synthesis along Perceptual Voice Quality Dimensions

📅 2025-01-15

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

This study addresses the challenge of precise, continuous control over perceptual voice quality (PVQ) dimensions—such as roughness, breathiness, resonance, and vocal weight—in speech therapy and voice acting training. We propose the first end-to-end text-to-speech (TTS) framework integrating conditional continuous normalizing flows (Conditional CNFs) to learn a continuous, controllable PVQ representation directly from expert-annotated speech data, eliminating reliance on explicit acoustic parameter modeling or hand-crafted rules. Experiments demonstrate fine-grained, perceptually discernible, and quantitatively measurable PVQ control for both seen and unseen speakers. Clinical and pedagogical validity is confirmed through subjective evaluation by certified speech-language pathologists, verifying the method’s efficacy in therapeutic and vocal training applications.

Technology Category

Application Category

📝 Abstract

While expressive speech synthesis or voice conversion systems mainly focus on controlling or manipulating abstract prosodic characteristics of speech, such as emotion or accent, we here address the control of perceptual voice qualities (PVQs) recognized by phonetic experts, which are speech properties at a lower level of abstraction. The ability to manipulate PVQs can be a valuable tool for teaching speech pathologists in training or voice actors. In this paper, we integrate a Conditional Continuous-Normalizing-Flow-based method into a Text-to-Speech system to modify perceptual voice attributes on a continuous scale. Unlike previous approaches, our system avoids direct manipulation of acoustic correlates and instead learns from examples. We demonstrate the system's capability by manipulating four voice qualities: Roughness, breathiness, resonance and weight. Phonetic experts evaluated these modifications, both for seen and unseen speaker conditions. The results highlight both the system's strengths and areas for improvement.

Problem

Research questions and friction points this paper is trying to address.

Voice Quality Adjustment

Speech Acoustics

Vocal Training

Innovation

Methods, ideas, or system contributions that make the work stand out.

Conditional Continuous Regularized Flow

Voice Synthesis

Adjustable Acoustic Features

🔎 Similar Papers

A Comprehensive Survey with Critical Analysis for Deepfake Speech Detection