🤖 AI Summary
This study addresses the challenge of precise, continuous control over perceptual voice quality (PVQ) dimensions—such as roughness, breathiness, resonance, and vocal weight—in speech therapy and voice acting training. We propose the first end-to-end text-to-speech (TTS) framework integrating conditional continuous normalizing flows (Conditional CNFs) to learn a continuous, controllable PVQ representation directly from expert-annotated speech data, eliminating reliance on explicit acoustic parameter modeling or hand-crafted rules. Experiments demonstrate fine-grained, perceptually discernible, and quantitatively measurable PVQ control for both seen and unseen speakers. Clinical and pedagogical validity is confirmed through subjective evaluation by certified speech-language pathologists, verifying the method’s efficacy in therapeutic and vocal training applications.
📝 Abstract
While expressive speech synthesis or voice conversion systems mainly focus on controlling or manipulating abstract prosodic characteristics of speech, such as emotion or accent, we here address the control of perceptual voice qualities (PVQs) recognized by phonetic experts, which are speech properties at a lower level of abstraction. The ability to manipulate PVQs can be a valuable tool for teaching speech pathologists in training or voice actors. In this paper, we integrate a Conditional Continuous-Normalizing-Flow-based method into a Text-to-Speech system to modify perceptual voice attributes on a continuous scale. Unlike previous approaches, our system avoids direct manipulation of acoustic correlates and instead learns from examples. We demonstrate the system's capability by manipulating four voice qualities: Roughness, breathiness, resonance and weight. Phonetic experts evaluated these modifications, both for seen and unseen speaker conditions. The results highlight both the system's strengths and areas for improvement.