A Framework for Synthetic Audio Conversations Generation Using Large Language Models

📅 2024-09-02

🏛️ 2024 IEEE/WIC International Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT)

📈 Citations: 2

✨ Influential: 0

career value

205K/year

🤖 AI Summary

Real conversational speech data for multi-speaker tasks—such as audio tagging, classification, and speaker identification—is scarce and expensive to annotate. Method: This paper proposes ConversaSynth, the first framework integrating multi-role large language model (LLM)-driven dialogue generation with high-fidelity text-to-speech (TTS) synthesis. It employs multi-role prompting, structured dialogue control, and topic-diverse sampling to generate end-to-end synthetic dialogue audio that is semantically coherent, speaker-discriminative, and acoustically natural. Contribution/Results: ConversaSynth jointly optimizes semantic consistency, speaker distinguishability, and speech naturalness. Experiments show that models trained on ConversaSynth-generated datasets achieve significantly improved downstream performance, with synthetic data approaching real conversations in both diversity and perceptual realism—establishing a new paradigm for high-quality synthetic data generation in low-resource multi-speaker speech tasks.

Technology Category

Application Category

📝 Abstract

In this paper, we introduce ConversaSynth, a framework designed to generate synthetic conversation audio using large language models (LLMs) with multiple persona settings. The framework first creates diverse and coherent text-based dialogues across various topics, which are then converted into audio using text-to-speech (TTS) systems. Our experiments demonstrate that ConversaSynth effectively generates high-quality synthetic audio datasets, which can significantly enhance the training and evaluation of models for audio tagging, audio classification, and multi-speaker speech recognition. The results indicate that the synthetic datasets generated by ConversaSynth exhibit substantial diversity and realism, making them suitable for developing robust, adaptable audio-based AI systems.

Problem

Research questions and friction points this paper is trying to address.

Generating synthetic audio conversations using LLMs

Enhancing audio tagging and classification models

Creating diverse datasets for multi-speaker speech recognition

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-based diverse dialogue generation

Text-to-speech conversion for audio

High-quality synthetic dataset creation

🔎 Similar Papers

No similar papers found.

Anthropic

$350,000—$500,000 USD

San Francisco, CA, USA

AI Research Scientist - Voice AI Team, Meta Superintelligence Labs