🤖 AI Summary
This study addresses the trade-off between likability controllability and speaker identity/linguistic content preservation in voice conversion. We propose an automatic scoring model–driven framework for likability control: first, a speech likability prediction model is trained to enable efficient, scalable preference annotation of large-scale synthetic speech corpora; second, an end-to-end voice conversion network is designed to explicitly regulate output likability while preserving speaker identity and semantic content. Comprehensive evaluation—combining subjective assessments (MOS, AB tests) and objective metrics (speaker similarity, ASR accuracy, phoneme error rate)—demonstrates strong correlation between predicted and human-rated likability (r > 0.85), continuous likability control capability, high speaker identity fidelity (>92%), and robust linguistic integrity (WER < 5.2%).
📝 Abstract
Perceived voice likability plays a crucial role in various social interactions, such as partner selection and advertising. A system that provides reference likable voice samples tailored to target audiences would enable users to adjust their speaking style and voice quality, facilitating smoother communication. To this end, we propose a voice conversion method that controls the likability of input speech while preserving both speaker identity and linguistic content. To improve training data scalability, we train a likability predictor on an existing voice likability dataset and employ it to automatically annotate a large speech synthesis corpus with likability ratings. Experimental evaluations reveal a significant correlation between the predictor's outputs and human-provided likability ratings. Subjective and objective evaluations further demonstrate that the proposed approach effectively controls voice likability while preserving both speaker identity and linguistic content.