Bottom-Up Synthesis of Knowledge-Grounded Task-Oriented Dialogues with Iteratively Self-Refined Prompts

📅 2025-04-19

📈 Citations: 0

✨ Influential: 0

career value

143K/year

🤖 AI Summary

Task-oriented dialogue data is scarce, and existing top-down synthetic approaches suffer from poor controllability and high hallucination rates. To address this, we propose a bottom-up, two-stage synthesis paradigm: first generating knowledge-grounded question-answer (QA) pairs under strict knowledge constraints, then assembling them into multi-turn dialogues via consistency modeling. Our method introduces three key innovations: (1) a decoupled, stepwise architecture that separates content generation from structural orchestration; (2) a privacy-preserving design enabling non-local large language models to participate only in non-sensitive stages; and (3) an iterative prompt refinement mechanism to suppress hallucinations. Experiments demonstrate that our synthesized dialogues significantly outperform end-to-end baselines in factual fidelity, knowledge accuracy, and task coherence. Human evaluation shows a 23% improvement in overall quality score and a 41% reduction in hallucination rate.

Technology Category

Application Category

📝 Abstract

Training conversational question-answering (QA) systems requires a substantial amount of in-domain data, which is often scarce in practice. A common solution to this challenge is to generate synthetic data. Traditional methods typically follow a top-down approach, where a large language model (LLM) generates multi-turn dialogues from a broad prompt. Although this method produces coherent conversations, it offers limited fine-grained control over the content and is susceptible to hallucinations. We introduce a bottom-up conversation synthesis approach, where QA pairs are generated first and then combined into a coherent dialogue. This method offers greater control and precision by dividing the process into two distinct steps, allowing refined instructions and validations to be handled separately. Additionally, this structure allows the use of non-local models in stages that do not involve proprietary knowledge, enhancing the overall quality of the generated data. Both human and automated evaluations demonstrate that our approach produces more realistic and higher-quality dialogues compared to top-down methods.

Problem

Research questions and friction points this paper is trying to address.

Lack of in-domain data for training QA systems

Limited control and hallucinations in top-down dialogue generation

Need for precise, high-quality synthetic dialogue data

Innovation

Methods, ideas, or system contributions that make the work stand out.

Bottom-up synthesis for QA pairs first

Iteratively refined prompts enhance precision

Non-local models improve data quality

🔎 Similar Papers

No similar papers found.