Voice Impression Control in Zero-Shot TTS

📅 2025-06-06

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

In zero-shot text-to-speech (TTS), precisely controlling paralinguistic attributes—such as brightness/darkness or warmth/coldness—via natural language descriptions to shape listeners’ perceptual voice impressions remains challenging. This paper proposes the first end-to-end text-to-impression vector generation framework. It introduces a low-dimensional, continuous impression intensity embedding space and couples a large language model (LLM) to enable differentiable, direct mapping from semantic descriptions to impression vectors, seamlessly integrated into a zero-shot TTS architecture. The method requires no reference audio or manual hyperparameter tuning, and supports high-fidelity, disentangled control over 20+ semantic antonymic impression pairs. Comprehensive objective and subjective evaluations demonstrate significant improvements over baselines across multi-dimensional impression synthesis tasks, with a +2.1 MOS gain in subjective listening tests.

Technology Category

Application Category

📝 Abstract

Para-/non-linguistic information in speech is pivotal in shaping the listeners' impression. Although zero-shot text-to-speech (TTS) has achieved high speaker fidelity, modulating subtle para-/non-linguistic information to control perceived voice characteristics, i.e., impressions, remains challenging. We have therefore developed a voice impression control method in zero-shot TTS that utilizes a low-dimensional vector to represent the intensities of various voice impression pairs (e.g., dark-bright). The results of both objective and subjective evaluations have demonstrated our method's effectiveness in impression control. Furthermore, generating this vector via a large language model enables target-impression generation from a natural language description of the desired impression, thus eliminating the need for manual optimization.

Problem

Research questions and friction points this paper is trying to address.

Control voice impressions in zero-shot TTS

Modulate para-/non-linguistic speech information

Generate impression vectors from natural language

Innovation

Methods, ideas, or system contributions that make the work stand out.

Low-dimensional vector controls voice impressions

LLM generates impression vectors from descriptions

Objective and subjective evaluations confirm effectiveness

🔎 Similar Papers

No similar papers found.