Prosody Analysis of Audiobooks

📅 2023-10-10
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current text-to-speech (TTS) systems for audiobooks suffer from inadequate modeling of expressive prosody—specifically pitch, loudness, and speaking rate—limiting naturalness and emotional engagement. Method: We propose a language model–driven prosody prediction framework, trained on 93 meticulously aligned book–audiobook text–speech pairs. This is the first systematic study to model audiobook-specific prosodic patterns, employing multi-task regression with explicit decoupling of prosodic attributes for fine-grained control. Results: On a test set of 24 audiobooks, our pitch predictions outperform commercial TTS in 22 books; loudness predictions better match human narration in 23 books. Large-scale human subjective evaluation confirms statistically significant improvements in reading naturalness (p < 0.01) and user preference. This work establishes a scalable prosody modeling paradigm for expressive TTS and releases a high-quality benchmark dataset for audiobook prosody research.
📝 Abstract
Recent advances in text-to-speech have made it possible to generate natural-sounding audio from text. However, audiobook narrations involve dramatic vocalizations and intonations by the reader, with greater reliance on emotions, dialogues, and descriptions in the narrative. Using our dataset of 93 aligned book-audiobook pairs, we present improved models for prosody prediction properties (pitch, volume, and rate of speech) from narrative text using language modeling. Our predicted prosody attributes correlate much better with human audiobook readings than results from a state-of-the-art commercial TTS system: our predicted pitch shows a higher correlation with human reading for 22 out of the 24 books, while our predicted volume attribute proves more similar to human reading for 23 out of the 24 books. Finally, we present a human evaluation study to quantify the extent that people prefer prosody-enhanced audiobook readings over commercial text-to-speech systems.
Problem

Research questions and friction points this paper is trying to address.

Text-to-Speech
Prosody
Emotion Recognition
Innovation

Methods, ideas, or system contributions that make the work stand out.

Advanced Algorithm
Pitch Volume Prediction
Natural-sounding Audiobooks
🔎 Similar Papers
No similar papers found.
C
Charuta G. Pethe
Department of Computer Science, Stony Brook University
Y
Yunting Yin
Department of Computer Science, Earlham College
S
S. Skiena
Department of Computer Science, Stony Brook University