ConvergeWriter: Data-Driven Bottom-Up Article Construction

📅 2025-09-16

📈 Citations: 0

✨ Influential: 0

career value

159K/year

🤖 AI Summary

Large language models (LLMs) struggle with generating long, factually grounded documents due to difficulties in knowledge integration and frequent hallucinations—primarily caused by conventional top-down approaches that decouple planning from available external knowledge. This paper proposes a data-driven bottom-up framework: it first iteratively retrieves external knowledge, then applies unsupervised document clustering to form semantically coherent knowledge clusters; based on these clusters, it constructs a hierarchical outline and constrains the LLM to generate content cluster-by-cluster. By anchoring document structure to empirically derived knowledge boundaries, the method ensures factual traceability and textual coherence. Evaluated on 14B- and 32B-parameter models, it significantly improves factual accuracy and structural consistency under knowledge-constrained conditions, matching or surpassing state-of-the-art methods.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) have shown remarkable prowess in text generation, yet producing long-form, factual documents grounded in extensive external knowledge bases remains a significant challenge. Existing "top-down" methods, which first generate a hypothesis or outline and then retrieve evidence, often suffer from a disconnect between the model's plan and the available knowledge, leading to content fragmentation and factual inaccuracies. To address these limitations, we propose a novel "bottom-up," data-driven framework that inverts the conventional generation pipeline. Our approach is predicated on a "Retrieval-First for Knowledge, Clustering for Structure" strategy, which first establishes the "knowledge boundaries" of the source corpus before any generative planning occurs. Specifically, we perform exhaustive iterative retrieval from the knowledge base and then employ an unsupervised clustering algorithm to organize the retrieved documents into distinct "knowledge clusters." These clusters form an objective, data-driven foundation that directly guides the subsequent generation of a hierarchical outline and the final document content. This bottom-up process ensures that the generated text is strictly constrained by and fully traceable to the source material, proactively adapting to the finite scope of the knowledge base and fundamentally mitigating the risk of hallucination. Experimental results on both 14B and 32B parameter models demonstrate that our method achieves performance comparable to or exceeding state-of-the-art baselines, and is expected to demonstrate unique advantages in knowledge-constrained scenarios that demand high fidelity and structural coherence. Our work presents an effective paradigm for generating reliable, structured, long-form documents, paving the way for more robust LLM applications in high-stakes, knowledge-intensive domains.

Problem

Research questions and friction points this paper is trying to address.

Generating long-form factual documents from knowledge bases

Overcoming content fragmentation and factual inaccuracies in LLMs

Mitigating hallucination risks in knowledge-constrained text generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Retrieval-first strategy for knowledge boundaries

Unsupervised clustering for document organization

Bottom-up generation ensuring content traceability

🔎 Similar Papers

No similar papers found.