🤖 AI Summary
Existing TTS systems lack large-scale, contextually rich, and finely linguistically annotated expressive speech data, hindering zero-shot expressive speech synthesis research. Method: We introduce LibriQuote—the first large-scale English expressive speech dataset—comprising 5.3K hours of character-dialogue expressive speech and 12.7K hours of neutral read speech. Each expressive utterance is paired with textual context and automatically generated pseudo-labels of speech verbs/adverbs. We further provide a 7.5-hour multi-emotion, multi-dialect test set and propose a novel evaluation paradigm for expressive synthesis that uses neutral speech as reference, validated via both objective metrics and subjective listening tests. Contribution/Results: Fine-tuning on LibriQuote significantly improves phoneme-level intelligibility; however, current TTS models still fall short of human-level expressive naturalness. The dataset and evaluation code are publicly released.
📝 Abstract
Text-to-speech (TTS) systems have recently achieved more expressive and natural speech synthesis by scaling to large speech datasets. However, the proportion of expressive speech in such large-scale corpora is often unclear. Besides, existing expressive speech corpora are typically smaller in scale and primarily used for benchmarking TTS systems. In this paper, we introduce the LibriQuote dataset, an English corpus derived from read audiobooks, designed for both fine-tuning and benchmarking expressive zero-shot TTS system. The training dataset includes 12.7K hours of read, non-expressive speech and 5.3K hours of mostly expressive speech drawn from character quotations. Each utterance in the expressive subset is supplemented with the context in which it was written, along with pseudo-labels of speech verbs and adverbs used to describe the quotation ( extit{e.g. ``he whispered softly''}). Additionally, we provide a challenging 7.5 hour test set intended for benchmarking TTS systems: given a neutral reference speech as input, we evaluate system's ability to synthesize an expressive utterance while preserving reference timbre. We validate qualitatively the test set by showing that it covers a wide range of emotions compared to non-expressive speech, along with various accents. Extensive subjective and objective evaluations show that fine-tuning a baseline TTS system on LibriQuote significantly improves its synthesized speech intelligibility, and that recent systems fail to synthesize speech as expressive and natural as the ground-truth utterances. The dataset and evaluation code are freely available. Audio samples can be found at https://libriquote.github.io/.