🤖 AI Summary
In zero-shot text-to-speech (TTS), precisely controlling paralinguistic attributes—such as brightness/darkness or warmth/coldness—via natural language descriptions to shape listeners’ perceptual voice impressions remains challenging. This paper proposes the first end-to-end text-to-impression vector generation framework. It introduces a low-dimensional, continuous impression intensity embedding space and couples a large language model (LLM) to enable differentiable, direct mapping from semantic descriptions to impression vectors, seamlessly integrated into a zero-shot TTS architecture. The method requires no reference audio or manual hyperparameter tuning, and supports high-fidelity, disentangled control over 20+ semantic antonymic impression pairs. Comprehensive objective and subjective evaluations demonstrate significant improvements over baselines across multi-dimensional impression synthesis tasks, with a +2.1 MOS gain in subjective listening tests.
📝 Abstract
Para-/non-linguistic information in speech is pivotal in shaping the listeners' impression. Although zero-shot text-to-speech (TTS) has achieved high speaker fidelity, modulating subtle para-/non-linguistic information to control perceived voice characteristics, i.e., impressions, remains challenging. We have therefore developed a voice impression control method in zero-shot TTS that utilizes a low-dimensional vector to represent the intensities of various voice impression pairs (e.g., dark-bright). The results of both objective and subjective evaluations have demonstrated our method's effectiveness in impression control. Furthermore, generating this vector via a large language model enables target-impression generation from a natural language description of the desired impression, thus eliminating the need for manual optimization.