DialogueForge: LLM Simulation of Human-Chatbot Dialogue

📅 2025-07-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
High-cost, low-efficiency manual construction of high-quality human-machine dialogue data hinders progress in task-oriented dialogue research. To address this, we propose DialogueForge: a framework that bootstraps from real user-system interactions and employs large language models (e.g., GPT-4o, Llama, Mistral) to simulate human users for generating multi-turn, task-oriented dialogues. Crucially, we empirically validate that small-scale open-source models—after supervised fine-tuning—can generate highly realistic, customizable dialogues. Evaluation via dual protocols (UniEval and GTEval) shows that proprietary LLMs achieve the best performance, while fine-tuned lightweight open-source models significantly improve dialogue naturalness and task consistency. Long-range coherence remains a persistent challenge across all models. Our work establishes a novel, cost-effective, and scalable paradigm for synthetic dialogue data generation.

Technology Category

Application Category

📝 Abstract
Collecting human-chatbot dialogues typically demands substantial manual effort and is time-consuming, which limits and poses challenges for research on conversational AI. In this work, we propose DialogueForge - a framework for generating AI-simulated conversations in human-chatbot style. To initialize each generated conversation, DialogueForge uses seed prompts extracted from real human-chatbot interactions. We test a variety of LLMs to simulate the human chatbot user, ranging from state-of-the-art proprietary models to small-scale open-source LLMs, and generate multi-turn dialogues tailored to specific tasks. In addition, we explore fine-tuning techniques to enhance the ability of smaller models to produce indistinguishable human-like dialogues. We evaluate the quality of the simulated conversations and compare different models using the UniEval and GTEval evaluation protocols. Our experiments show that large proprietary models (e.g., GPT-4o) generally outperform others in generating more realistic dialogues, while smaller open-source models (e.g., Llama, Mistral) offer promising performance with greater customization. We demonstrate that the performance of smaller models can be significantly improved by employing supervised fine-tuning techniques. Nevertheless, maintaining coherent and natural long-form human-like dialogues remains a common challenge across all models.
Problem

Research questions and friction points this paper is trying to address.

Automating human-chatbot dialogue collection to reduce manual effort
Enhancing small LLMs to generate realistic human-like conversations
Evaluating dialogue quality across different LLM sizes and types
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses seed prompts from real human-chatbot interactions
Tests various LLMs for simulating human-chatbot dialogues
Employs fine-tuning to enhance smaller models' performance
🔎 Similar Papers
R
Ruizhe Zhu
ETH Zurich
H
Hao Zhu
ETH Zurich
Y
Yaxuan Li
ETH Zurich
S
Syang Zhou
Calvin Risk AG
S
Shijing Cai
Calvin Risk AG
M
Malgorzata Lazuka
Calvin Risk AG
Elliott Ash
Elliott Ash
Associate Professor of Law, Economics, and Data Science
Law and EconomicsPolitical EconomyText as DataLarge Language Models