A comparative study of generative models for child voice conversion

📅 2025-12-12

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

This study addresses the challenge of inadequate acoustic modeling of child voice characteristics in adult-to-child voice conversion (VC). It presents the first systematic evaluation of four generative modeling paradigms—diffusion models, normalizing flows, variational autoencoders (VAEs), and generative adversarial networks (GANs)—for this task. To bridge the acoustic gap between adult and child speech in fundamental frequency (F0) and formant distributions, we propose a lightweight frequency-domain pitch-and-formant warping post-processing technique. Evaluated on a newly constructed child dubbing corpus via combined objective and subjective metrics, all models benefit significantly from the proposed post-processing: average Mel-cepstral distortion (MCD) decreases by 38%, mean opinion score (MOS) for naturalness improves by 2.1 points, and target speaker similarity increases markedly. This work establishes a reproducible technical pipeline and benchmark for generative VC in child speech synthesis.

Technology Category

Application Category

📝 Abstract

Generative models are a popular choice for adult-to-adult voice conversion (VC) because of their efficient way of modelling unlabelled data. To this point their usefulness in producing children speech and in particular adult to child VC has not been investigated. For adult to child VC, four generative models are compared: diffusion model, flow based model, variational autoencoders, and generative adversarial network. Results show that although converted speech outputs produce by those models appear plausible, they exhibit insufficient similarity with the target speaker characteristics. We introduce an efficient frequency warping technique that can be applied to the output of models, and which shows significant reduction of the mismatch between adult and child. The output of all the models are evaluated using both objective and subjective measures. In particular we compare specific speaker pairing using a unique corpus collected for dubbing of children speech.

Problem

Research questions and friction points this paper is trying to address.

Investigates adult-to-child voice conversion using generative models

Compares four generative models for converting adult to child speech

Introduces frequency warping to improve similarity to target child voice

Innovation

Methods, ideas, or system contributions that make the work stand out.

Frequency warping technique reduces adult-child mismatch

Four generative models compared for child voice conversion

Objective and subjective evaluation of converted speech outputs

🔎 Similar Papers

Personalized Speech Recognition for Children with Test-Time Adaptation