DocTalk: Scalable Graph-based Dialogue Synthesis for Enhancing LLM Conversational Capabilities

๐Ÿ“… 2025-07-08
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Large language models (LLMs) are typically pretrained on monolithic prose texts, leading to structural mismatch with multi-turn dialogue tasks. To address this, we propose a document-graph-based framework for synthesizing multi-turn dialogues: first, Wikipedia articles are clustered and semantically modeled to construct a document association graph; then, graph traversal generates cross-topic, long-horizon information-seeking dialogue paths, with logical coherence ensured via prompt engineering and rule-based constraints. Using this method, we construct a high-quality pretraining corpus comprising over 730,000 multi-turn dialoguesโ€”the first large-scale, structured, long-horizon dialogue dataset automatically generated from non-dialogic text. Incorporating this corpus into LLM pretraining improves contextual memory and comprehension by 40%, significantly enhances multi-turn dialogue performance, and preserves core language modeling capabilities.

Technology Category

Application Category

๐Ÿ“ Abstract
Large Language Models (LLMs) are increasingly employed in multi-turn conversational tasks, yet their pre-training data predominantly consists of continuous prose, creating a potential mismatch between required capabilities and training paradigms. We introduce a novel approach to address this discrepancy by synthesizing conversational data from existing text corpora. We present a pipeline that transforms a cluster of multiple related documents into an extended multi-turn, multi-topic information-seeking dialogue. Applying our pipeline to Wikipedia articles, we curate DocTalk, a multi-turn pre-training dialogue corpus consisting of over 730k long conversations. We hypothesize that exposure to such synthesized conversational structures during pre-training can enhance the fundamental multi-turn capabilities of LLMs, such as context memory and understanding. Empirically, we show that incorporating DocTalk during pre-training results in up to 40% gain in context memory and understanding, without compromising base performance. DocTalk is available at https://huggingface.co/datasets/AmazonScience/DocTalk.
Problem

Research questions and friction points this paper is trying to address.

Addresses mismatch between LLM training data and conversational needs
Synthesizes multi-turn dialogues from text corpora for LLM pre-training
Enhances LLM context memory and understanding via structured conversations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Graph-based dialogue synthesis from text corpora
Transforms document clusters into multi-turn dialogues
Enhances LLM context memory by 40%
๐Ÿ”Ž Similar Papers
No similar papers found.