A Framework for Synthetic Audio Conversations Generation Using Large Language Models

📅 2024-09-02
🏛️ 2024 IEEE/WIC International Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT)
📈 Citations: 2
Influential: 0
📄 PDF
🤖 AI Summary
Real conversational speech data for multi-speaker tasks—such as audio tagging, classification, and speaker identification—is scarce and expensive to annotate. Method: This paper proposes ConversaSynth, the first framework integrating multi-role large language model (LLM)-driven dialogue generation with high-fidelity text-to-speech (TTS) synthesis. It employs multi-role prompting, structured dialogue control, and topic-diverse sampling to generate end-to-end synthetic dialogue audio that is semantically coherent, speaker-discriminative, and acoustically natural. Contribution/Results: ConversaSynth jointly optimizes semantic consistency, speaker distinguishability, and speech naturalness. Experiments show that models trained on ConversaSynth-generated datasets achieve significantly improved downstream performance, with synthetic data approaching real conversations in both diversity and perceptual realism—establishing a new paradigm for high-quality synthetic data generation in low-resource multi-speaker speech tasks.

Technology Category

Application Category

📝 Abstract
In this paper, we introduce ConversaSynth, a framework designed to generate synthetic conversation audio using large language models (LLMs) with multiple persona settings. The framework first creates diverse and coherent text-based dialogues across various topics, which are then converted into audio using text-to-speech (TTS) systems. Our experiments demonstrate that ConversaSynth effectively generates high-quality synthetic audio datasets, which can significantly enhance the training and evaluation of models for audio tagging, audio classification, and multi-speaker speech recognition. The results indicate that the synthetic datasets generated by ConversaSynth exhibit substantial diversity and realism, making them suitable for developing robust, adaptable audio-based AI systems.
Problem

Research questions and friction points this paper is trying to address.

Generating synthetic audio conversations using LLMs
Enhancing audio tagging and classification models
Creating diverse datasets for multi-speaker speech recognition
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-based diverse dialogue generation
Text-to-speech conversion for audio
High-quality synthetic dataset creation
🔎 Similar Papers
No similar papers found.
K
Kaung Myat Kyaw
Innovative Cognitive Computing (IC2) Research Center, School of Information Technology, King Mongkut’s University of Technology Thonburi, Bangkok, Thailand
J
Jonathan Hoyin Chan
Innovative Cognitive Computing (IC2) Research Center, School of Information Technology, King Mongkut’s University of Technology Thonburi, Bangkok, Thailand