Voice Impression Control in Zero-Shot TTS

📅 2025-06-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In zero-shot text-to-speech (TTS), precisely controlling paralinguistic attributes—such as brightness/darkness or warmth/coldness—via natural language descriptions to shape listeners’ perceptual voice impressions remains challenging. This paper proposes the first end-to-end text-to-impression vector generation framework. It introduces a low-dimensional, continuous impression intensity embedding space and couples a large language model (LLM) to enable differentiable, direct mapping from semantic descriptions to impression vectors, seamlessly integrated into a zero-shot TTS architecture. The method requires no reference audio or manual hyperparameter tuning, and supports high-fidelity, disentangled control over 20+ semantic antonymic impression pairs. Comprehensive objective and subjective evaluations demonstrate significant improvements over baselines across multi-dimensional impression synthesis tasks, with a +2.1 MOS gain in subjective listening tests.

Technology Category

Application Category

📝 Abstract
Para-/non-linguistic information in speech is pivotal in shaping the listeners' impression. Although zero-shot text-to-speech (TTS) has achieved high speaker fidelity, modulating subtle para-/non-linguistic information to control perceived voice characteristics, i.e., impressions, remains challenging. We have therefore developed a voice impression control method in zero-shot TTS that utilizes a low-dimensional vector to represent the intensities of various voice impression pairs (e.g., dark-bright). The results of both objective and subjective evaluations have demonstrated our method's effectiveness in impression control. Furthermore, generating this vector via a large language model enables target-impression generation from a natural language description of the desired impression, thus eliminating the need for manual optimization.
Problem

Research questions and friction points this paper is trying to address.

Control voice impressions in zero-shot TTS
Modulate para-/non-linguistic speech information
Generate impression vectors from natural language
Innovation

Methods, ideas, or system contributions that make the work stand out.

Low-dimensional vector controls voice impressions
LLM generates impression vectors from descriptions
Objective and subjective evaluations confirm effectiveness
🔎 Similar Papers
No similar papers found.