🤖 AI Summary
This work addresses fine-grained prosodic control—specifically pitch, emotion, and speaking rate—in controllable text-to-speech (TTS) via textual persona descriptions. We propose the first persona-to-voice prompt rewriting framework, leveraging two LLM-based rewriting strategies to transform unstructured persona texts into structured, speech-style prompts. To enhance control fidelity, we integrate prosody-disentangled acoustic modeling with tailored prompt engineering. Notably, we systematically uncover and quantify implicit societal biases—particularly gender bias—introduced by LLMs during persona rewriting, a previously unexplored issue. Extensive experiments demonstrate significant improvements in synthesized speech naturalness, intelligibility, and style consistency. Our approach achieves state-of-the-art performance across both objective metrics and multi-dimensional subjective evaluations, including MOS, SIM, and AB tests.
📝 Abstract
In this paper, we propose a novel framework to control voice style in prompt-based, controllable text-to-speech systems by leveraging textual personas as voice style prompts. We present two persona rewriting strategies to transform generic persona descriptions into speech-oriented prompts, enabling fine-grained manipulation of prosodic attributes such as pitch, emotion, and speaking rate. Experimental results demonstrate that our methods enhance the naturalness, clarity, and consistency of synthesized speech. Finally, we analyze implicit social biases introduced by LLM-based rewriting, with a focus on gender. We underscore voice style as a crucial factor for persona-driven AI dialogue systems.