LibriQuote: A Speech Dataset of Fictional Character Utterances for Expressive Zero-Shot Speech Synthesis

📅 2025-09-04

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

Existing TTS systems lack large-scale, contextually rich, and finely linguistically annotated expressive speech data, hindering zero-shot expressive speech synthesis research. Method: We introduce LibriQuote—the first large-scale English expressive speech dataset—comprising 5.3K hours of character-dialogue expressive speech and 12.7K hours of neutral read speech. Each expressive utterance is paired with textual context and automatically generated pseudo-labels of speech verbs/adverbs. We further provide a 7.5-hour multi-emotion, multi-dialect test set and propose a novel evaluation paradigm for expressive synthesis that uses neutral speech as reference, validated via both objective metrics and subjective listening tests. Contribution/Results: Fine-tuning on LibriQuote significantly improves phoneme-level intelligibility; however, current TTS models still fall short of human-level expressive naturalness. The dataset and evaluation code are publicly released.

Technology Category

Application Category

📝 Abstract

Text-to-speech (TTS) systems have recently achieved more expressive and natural speech synthesis by scaling to large speech datasets. However, the proportion of expressive speech in such large-scale corpora is often unclear. Besides, existing expressive speech corpora are typically smaller in scale and primarily used for benchmarking TTS systems. In this paper, we introduce the LibriQuote dataset, an English corpus derived from read audiobooks, designed for both fine-tuning and benchmarking expressive zero-shot TTS system. The training dataset includes 12.7K hours of read, non-expressive speech and 5.3K hours of mostly expressive speech drawn from character quotations. Each utterance in the expressive subset is supplemented with the context in which it was written, along with pseudo-labels of speech verbs and adverbs used to describe the quotation ( extit{e.g. ``he whispered softly''}). Additionally, we provide a challenging 7.5 hour test set intended for benchmarking TTS systems: given a neutral reference speech as input, we evaluate system's ability to synthesize an expressive utterance while preserving reference timbre. We validate qualitatively the test set by showing that it covers a wide range of emotions compared to non-expressive speech, along with various accents. Extensive subjective and objective evaluations show that fine-tuning a baseline TTS system on LibriQuote significantly improves its synthesized speech intelligibility, and that recent systems fail to synthesize speech as expressive and natural as the ground-truth utterances. The dataset and evaluation code are freely available. Audio samples can be found at https://libriquote.github.io/.

Problem

Research questions and friction points this paper is trying to address.

Creating a dataset for expressive zero-shot TTS synthesis

Addressing unclear proportion of expressive speech in corpora

Providing labeled expressive speech with context and emotions

Innovation

Methods, ideas, or system contributions that make the work stand out.

LibriQuote dataset for expressive zero-shot TTS

Training data with context and pseudo-labels

Test set for benchmarking expressive synthesis

🔎 Similar Papers

No similar papers found.