🤖 AI Summary
The scarcity of high-quality, diverse dialogue data severely constrains the training and evaluation of dialogue AI systems.
Method: This paper proposes a dynamic few-shot centroid-driven multi-agent iterative generation framework. It constructs an evolvable few-shot prompt library and simulates authentic dialogue behaviors via collaborative multi-agent interaction, dynamically sampling, refining, and regenerating utterances under task guidance to jointly optimize semantic fidelity, intent diversity, and downstream task adaptability.
Contribution/Results: Compared with static prompting or single-turn generation paradigms, our approach significantly improves synthetic data quality coverage and distributional realism. Empirical evaluation on downstream tasks—including intent classification and dialogue summarization—demonstrates average performance gains of 3.2–5.8 percentage points. These results underscore the critical role of high-fidelity synthetic data generation across the full lifecycle of dialogue AI development, from pretraining to evaluation and fine-tuning.
📝 Abstract
In this paper, we present ConvoGen: an innovative framework for generating synthetic conversational data using multi-agent systems. Our method leverages few-shot learning and introduces iterative sampling from a dynamically updated few-shot hub to create diverse and realistic conversational scenarios. The generated data has numerous applications, including training and evaluating conversational AI models, and augmenting existing datasets for tasks like conversational intent classification or conversation summarization. Our experiments demonstrate the effectiveness of this method in producing high-quality diverse synthetic conversational data, highlighting its potential to enhance the development and evaluation of conversational AI systems.