LibriTTS-VI: A Public Corpus and Novel Methods for Efficient Voice Impression Control

📅 2025-09-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses two key bottlenecks in impression-controllable text-to-speech (TTS)—impression leakage (i.e., interference from speaker identity in reference audio) and the lack of publicly available, finely annotated corpora—for fine-grained control of vocal impressions (e.g., “bright”, “calm”). Methodologically: (1) we propose a disentangled representation learning strategy to explicitly separate speaker identity from target impression; (2) we design a reference-free impression-controllable TTS architecture; and (3) we release LibriTTS-VI, the first open-source dataset featuring standardized, multidimensional impression annotations. Experiments demonstrate significant improvements: impression vector prediction MSE decreases from 0.61 to 0.41 (objective metric), and subjective impression error drops from 1.15 to 0.92. The approach enhances both control accuracy and speech fidelity while ensuring reproducibility and cross-speaker generalizability.

Technology Category

Application Category

📝 Abstract
Fine-grained control over voice impressions (e.g., making a voice brighter or calmer) is a key frontier for creating more controllable text-to-speech. However, this nascent field faces two key challenges. The first is the problem of impression leakage, where the synthesized voice is undesirably influenced by the speaker's reference audio, rather than the separately specified target impression, and the second is the lack of a public, annotated corpus. To mitigate impression leakage, we propose two methods: 1) a training strategy that separately uses an utterance for speaker identity and another utterance of the same speaker for target impression, and 2) a novel reference-free model that generates a speaker embedding solely from the target impression, achieving the benefits of improved robustness against the leakage and the convenience of reference-free generation. Objective and subjective evaluations demonstrate a significant improvement in controllability. Our best method reduced the mean squared error of 11-dimensional voice impression vectors from 0.61 to 0.41 objectively and from 1.15 to 0.92 subjectively, while maintaining high fidelity. To foster reproducible research, we introduce LibriTTS-VI, the first public voice impression dataset released with clear annotation standards, built upon the LibriTTS-R corpus.
Problem

Research questions and friction points this paper is trying to address.

Addressing impression leakage in voice synthesis control
Lack of public annotated corpus for voice impressions
Improving fine-grained controllability of voice characteristics
Innovation

Methods, ideas, or system contributions that make the work stand out.

Training strategy separates speaker identity and impression
Reference-free model generates speaker embedding from impression
Public LibriTTS-VI dataset with clear annotation standards
🔎 Similar Papers
J
Junki Ohmura
Sony Group Corporation
Y
Yuki Ito
Sony Group Corporation
Emiru Tsunoo
Emiru Tsunoo
Sony Corp.
Speech RecognitionMusic Information Retrieval
T
Toshiyuki Sekiya
Sony Group Corporation
T
Toshiyuki Kumakura
Sony Group Corporation