🤖 AI Summary
This work addresses the challenge of personalized speech synthesis and editing for individuals with speech impairments—without requiring original voice recordings. We propose a human-in-the-loop latent-space navigation framework: (1) a neural acoustic analysis-synthesis model establishes an interpretable speaker latent embedding space; (2) an interactive auditory feedback mechanism guides iterative optimization; and (3) a novel semantic direction discovery method—incorporating Jacobian matrix analysis—precisely disentangles six phonetic attributes (e.g., pitch, nasality, timbre), enabling independent, targeted, and continuous control. In both simulation and real-user studies, our approach achieves efficient convergence to target speech profiles, significantly enhancing voice personalization fidelity and clinical applicability.
📝 Abstract
This paper presents a user-driven approach for synthesizing specific target voices based on user feedback rather than reference recordings, which is particularly beneficial for speech-impaired individuals who want to recreate their lost voices but lack prior recordings. Our method leverages the neural analysis and synthesis framework to construct a latent speaker embedding space. Within this latent space, a human-in-the-loop search algorithm guides the voice generation process. Users participate in a series of straightforward listening-and-comparison tasks, providing feedback that iteratively refines the synthesized voice to match their desired target. Both computer simulations and real-world user studies demonstrate that the proposed approach can effectively approximate target voices. Moreover, by analyzing the mel-spectrogram generator's Jacobians, we identify a set of meaningful voice editing directions within the latent space. These directions enable users to further fine-tune specific attributes of the generated voice, including the pitch level, pitch range, volume, vocal tension, nasality, and tone color.