🤖 AI Summary
Real conversational speech data for multi-speaker tasks—such as audio tagging, classification, and speaker identification—is scarce and expensive to annotate. Method: This paper proposes ConversaSynth, the first framework integrating multi-role large language model (LLM)-driven dialogue generation with high-fidelity text-to-speech (TTS) synthesis. It employs multi-role prompting, structured dialogue control, and topic-diverse sampling to generate end-to-end synthetic dialogue audio that is semantically coherent, speaker-discriminative, and acoustically natural. Contribution/Results: ConversaSynth jointly optimizes semantic consistency, speaker distinguishability, and speech naturalness. Experiments show that models trained on ConversaSynth-generated datasets achieve significantly improved downstream performance, with synthetic data approaching real conversations in both diversity and perceptual realism—establishing a new paradigm for high-quality synthetic data generation in low-resource multi-speaker speech tasks.
📝 Abstract
In this paper, we introduce ConversaSynth, a framework designed to generate synthetic conversation audio using large language models (LLMs) with multiple persona settings. The framework first creates diverse and coherent text-based dialogues across various topics, which are then converted into audio using text-to-speech (TTS) systems. Our experiments demonstrate that ConversaSynth effectively generates high-quality synthetic audio datasets, which can significantly enhance the training and evaluation of models for audio tagging, audio classification, and multi-speaker speech recognition. The results indicate that the synthetic datasets generated by ConversaSynth exhibit substantial diversity and realism, making them suitable for developing robust, adaptable audio-based AI systems.