Synthetic Data Generation for Phrase Break Prediction with Large Language Model

📅 2025-07-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Phrase pause prediction heavily relies on costly manual annotations and suffers from unstable data quality due to speech coarticulation and phonetic variation. Method: This paper introduces, for the first time, a large language model (LLM)-based synthetic data generation approach that automatically produces multilingual text corpora with precise phrase-boundary annotations—bypassing conventional speech annotation pipelines. A multilingual comparative evaluation framework is proposed to rigorously assess the efficacy of LLM-synthesized data for pause prediction. Contribution/Results: Experiments across multiple languages demonstrate that models trained exclusively on LLM-generated data achieve performance comparable to those trained on human-annotated ground truth, significantly alleviating the data scarcity bottleneck. This work establishes a scalable, low-cost, high-quality data provisioning paradigm for low-resource text-to-speech and prosody modeling tasks.

Technology Category

Application Category

📝 Abstract
Current approaches to phrase break prediction address crucial prosodic aspects of text-to-speech systems but heavily rely on vast human annotations from audio or text, incurring significant manual effort and cost. Inherent variability in the speech domain, driven by phonetic factors, further complicates acquiring consistent, high-quality data. Recently, large language models (LLMs) have shown success in addressing data challenges in NLP by generating tailored synthetic data while reducing manual annotation needs. Motivated by this, we explore leveraging LLM to generate synthetic phrase break annotations, addressing the challenges of both manual annotation and speech-related tasks by comparing with traditional annotations and assessing effectiveness across multiple languages. Our findings suggest that LLM-based synthetic data generation effectively mitigates data challenges in phrase break prediction and highlights the potential of LLMs as a viable solution for the speech domain.
Problem

Research questions and friction points this paper is trying to address.

Reducing reliance on costly human annotations for phrase breaks
Addressing variability in speech data for consistent predictions
Exploring LLMs to generate synthetic phrase break annotations
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM generates synthetic phrase break annotations
Reduces manual annotation effort significantly
Effective across multiple language datasets
🔎 Similar Papers
No similar papers found.
H
Hoyeon Lee
NA VER Cloud, South Korea
S
Sejung Son
NHN, South Korea
Y
Ye-Eun Kang
Yale University, USA
Jong-Hwan Kim
Jong-Hwan Kim
Professor of Electrical Engineering, KAIST
AI RoboticsIntelligence TechnologyMachine Intelligence Learning