🤖 AI Summary
This study addresses the underrepresentation of prosodic phrasing in spontaneous speech synthesis by systematically investigating the impact of manual versus automatic prosodic segmentation on non-autoregressive Brazilian Portuguese speech synthesis (FastSpeech 2). Using an open-source dataset licensed under CC BY-NC-ND 4.0, it presents the first comparative evaluation of these two annotation approaches regarding intonation modeling, pause control, and fluency enhancement. Results show that explicit prosodic segmentation yields modest improvements in intelligibility and acoustic naturalness. Both methods successfully reproduce core accent patterns; however, manual annotation—by preserving greater prosodic variability—significantly outperforms automatic segmentation in nuclear pitch contour fidelity and prosodic diversity. This work provides empirical evidence and methodological guidance for fine-grained prosodic modeling in spontaneous speech synthesis.
📝 Abstract
Spontaneous speech presents several challenges for speech synthesis, particularly in capturing the natural flow of conversation, including turn-taking, pauses, and disfluencies. Although speech synthesis systems have made significant progress in generating natural and intelligible speech, primarily through architectures that implicitly model prosodic features such as pitch, intensity, and duration, the construction of datasets with explicit prosodic segmentation and their impact on spontaneous speech synthesis remains largely unexplored. This paper evaluates the effects of manual and automatic prosodic segmentation annotations in Brazilian Portuguese on the quality of speech synthesized by a non-autoregressive model, FastSpeech 2. Experimental results show that training with prosodic segmentation produced slightly more intelligible and acoustically natural speech. While automatic segmentation tends to create more regular segments, manual prosodic segmentation introduces greater variability, which contributes to more natural prosody. Analysis of neutral declarative utterances showed that both training approaches reproduced the expected nuclear accent pattern, but the prosodic model aligned more closely with natural pre-nuclear contours. To support reproducibility and future research, all datasets, source codes, and trained models are publicly available under the CC BY-NC-ND 4.0 license.