The Impact of Prosodic Segmentation on Speech Synthesis of Spontaneous Speech

📅 2025-11-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the underrepresentation of prosodic phrasing in spontaneous speech synthesis by systematically investigating the impact of manual versus automatic prosodic segmentation on non-autoregressive Brazilian Portuguese speech synthesis (FastSpeech 2). Using an open-source dataset licensed under CC BY-NC-ND 4.0, it presents the first comparative evaluation of these two annotation approaches regarding intonation modeling, pause control, and fluency enhancement. Results show that explicit prosodic segmentation yields modest improvements in intelligibility and acoustic naturalness. Both methods successfully reproduce core accent patterns; however, manual annotation—by preserving greater prosodic variability—significantly outperforms automatic segmentation in nuclear pitch contour fidelity and prosodic diversity. This work provides empirical evidence and methodological guidance for fine-grained prosodic modeling in spontaneous speech synthesis.

Technology Category

Application Category

📝 Abstract
Spontaneous speech presents several challenges for speech synthesis, particularly in capturing the natural flow of conversation, including turn-taking, pauses, and disfluencies. Although speech synthesis systems have made significant progress in generating natural and intelligible speech, primarily through architectures that implicitly model prosodic features such as pitch, intensity, and duration, the construction of datasets with explicit prosodic segmentation and their impact on spontaneous speech synthesis remains largely unexplored. This paper evaluates the effects of manual and automatic prosodic segmentation annotations in Brazilian Portuguese on the quality of speech synthesized by a non-autoregressive model, FastSpeech 2. Experimental results show that training with prosodic segmentation produced slightly more intelligible and acoustically natural speech. While automatic segmentation tends to create more regular segments, manual prosodic segmentation introduces greater variability, which contributes to more natural prosody. Analysis of neutral declarative utterances showed that both training approaches reproduced the expected nuclear accent pattern, but the prosodic model aligned more closely with natural pre-nuclear contours. To support reproducibility and future research, all datasets, source codes, and trained models are publicly available under the CC BY-NC-ND 4.0 license.
Problem

Research questions and friction points this paper is trying to address.

Evaluating prosodic segmentation's impact on spontaneous speech synthesis quality
Comparing manual versus automatic prosodic annotations for Brazilian Portuguese synthesis
Assessing how explicit prosodic features improve naturalness in non-autoregressive models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Manual and automatic prosodic segmentation annotations
Non-autoregressive FastSpeech 2 model training
Prosodic segmentation enhances intelligibility and naturalness
🔎 Similar Papers
No similar papers found.
J
Julio Cesar Galdino
University of São Paulo, São Carlos, SP, Brazil
S
Sidney Evaldo Leal
University of São Paulo, São Carlos, SP, Brazil
L
Leticia Gabriella De Souza
Universidade Estadual Paulista, São José do Rio Preto, SP, Brazil
R
Rodrigo de Freitas Lima
University of São Paulo, São Carlos, SP, Brazil
A
Antonio Nelson Fornari Mendes Moreira
University of São Paulo, São Carlos, SP, Brazil
Arnaldo Candido Junior
Arnaldo Candido Junior
Computer Cience Professor, São Paulo State University
Deep LearningNatural Language ProcessingArtifical Neural NetworksMachine Learning
M
Miguel Oliveira
Universidade Federal de Alagoas, Maceió, AL, Brazil
Edresson Casanova
Edresson Casanova
Senior Research Scientist at NVIDIA
Text-to-SpeechSpeech SynthesisSpeech processingDuplex S2S
S
S. Alu'isio
University of São Paulo, São Carlos, SP, Brazil