🤖 AI Summary
High-cost, low-efficiency manual construction of high-quality human-machine dialogue data hinders progress in task-oriented dialogue research. To address this, we propose DialogueForge: a framework that bootstraps from real user-system interactions and employs large language models (e.g., GPT-4o, Llama, Mistral) to simulate human users for generating multi-turn, task-oriented dialogues. Crucially, we empirically validate that small-scale open-source models—after supervised fine-tuning—can generate highly realistic, customizable dialogues. Evaluation via dual protocols (UniEval and GTEval) shows that proprietary LLMs achieve the best performance, while fine-tuned lightweight open-source models significantly improve dialogue naturalness and task consistency. Long-range coherence remains a persistent challenge across all models. Our work establishes a novel, cost-effective, and scalable paradigm for synthetic dialogue data generation.
📝 Abstract
Collecting human-chatbot dialogues typically demands substantial manual effort and is time-consuming, which limits and poses challenges for research on conversational AI. In this work, we propose DialogueForge - a framework for generating AI-simulated conversations in human-chatbot style. To initialize each generated conversation, DialogueForge uses seed prompts extracted from real human-chatbot interactions. We test a variety of LLMs to simulate the human chatbot user, ranging from state-of-the-art proprietary models to small-scale open-source LLMs, and generate multi-turn dialogues tailored to specific tasks. In addition, we explore fine-tuning techniques to enhance the ability of smaller models to produce indistinguishable human-like dialogues. We evaluate the quality of the simulated conversations and compare different models using the UniEval and GTEval evaluation protocols. Our experiments show that large proprietary models (e.g., GPT-4o) generally outperform others in generating more realistic dialogues, while smaller open-source models (e.g., Llama, Mistral) offer promising performance with greater customization. We demonstrate that the performance of smaller models can be significantly improved by employing supervised fine-tuning techniques. Nevertheless, maintaining coherent and natural long-form human-like dialogues remains a common challenge across all models.