User-Driven Voice Generation and Editing through Latent Space Navigation

📅 2024-08-30
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF

career value

198K/year
🤖 AI Summary
This work addresses the challenge of personalized speech synthesis and editing for individuals with speech impairments—without requiring original voice recordings. We propose a human-in-the-loop latent-space navigation framework: (1) a neural acoustic analysis-synthesis model establishes an interpretable speaker latent embedding space; (2) an interactive auditory feedback mechanism guides iterative optimization; and (3) a novel semantic direction discovery method—incorporating Jacobian matrix analysis—precisely disentangles six phonetic attributes (e.g., pitch, nasality, timbre), enabling independent, targeted, and continuous control. In both simulation and real-user studies, our approach achieves efficient convergence to target speech profiles, significantly enhancing voice personalization fidelity and clinical applicability.

Technology Category

Application Category

📝 Abstract
This paper presents a user-driven approach for synthesizing specific target voices based on user feedback rather than reference recordings, which is particularly beneficial for speech-impaired individuals who want to recreate their lost voices but lack prior recordings. Our method leverages the neural analysis and synthesis framework to construct a latent speaker embedding space. Within this latent space, a human-in-the-loop search algorithm guides the voice generation process. Users participate in a series of straightforward listening-and-comparison tasks, providing feedback that iteratively refines the synthesized voice to match their desired target. Both computer simulations and real-world user studies demonstrate that the proposed approach can effectively approximate target voices. Moreover, by analyzing the mel-spectrogram generator's Jacobians, we identify a set of meaningful voice editing directions within the latent space. These directions enable users to further fine-tune specific attributes of the generated voice, including the pitch level, pitch range, volume, vocal tension, nasality, and tone color.
Problem

Research questions and friction points this paper is trying to address.

Speech Synthesis
Language Impairment
Customizable Voices
Innovation

Methods, ideas, or system contributions that make the work stand out.

Neural Analysis and Synthesis
Voice Customization Space
Feedback-driven Optimization
🔎 Similar Papers
No similar papers found.