Thread: A Logic-Based Data Organization Paradigm for How-To Question Answering with Retrieval Augmented Generation

πŸ“… 2024-06-19
πŸ›οΈ arXiv.org
πŸ“ˆ Citations: 2
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing RAG systems struggle with β€œhow-to” questions due to logical fragmentation and step disconnection, primarily caused by conventional fixed-size chunking that disrupts the semantic coherence of procedural knowledge. To address this, we propose Threadβ€”a novel data organization paradigm grounded in logically cohesive, semantically self-contained units. Leveraging large language models, documents are parsed and reconstructed into loosely coupled reasoning threads, enabling fine-grained, process-aware knowledge representation. Our method integrates logic-driven document segmentation, cross-format adaptive indexing, and retrieval-augmented generation. Experiments across open-domain and industrial benchmarks demonstrate a 21–33% improvement in β€œhow-to” question resolution success rate, a 75% reduction in candidate knowledge volume, and substantial gains in retrieval efficiency and answer executability.

Technology Category

Application Category

πŸ“ Abstract
Recent advances in retrieval-augmented generation have significantly improved the performance of question-answering systems, particularly on factoid '5Ws' questions. However, these systems still face substantial challenges when addressing '1H' questions, specifically how-to questions, which are integral to decision-making processes and require dynamic, step-by-step answers. The key limitation lies in the prevalent data organization paradigm, chunk, which divides documents into fixed-size segments, and disrupts the logical coherence and connections within the context. To overcome this, in this paper, we propose Thread, a novel data organization paradigm aimed at enabling current systems to handle how-to questions more effectively. Specifically, we introduce a new knowledge granularity, termed 'logic unit', where documents are transformed into more structured and loosely interconnected logic units with large language models. Extensive experiments conducted across both open-domain and industrial settings demonstrate that Thread outperforms existing paradigms significantly, improving the success rate of handling how-to questions by 21% to 33%. Moreover, Thread exhibits high adaptability in processing various document formats, drastically reducing the candidate quantity in the knowledge base and minimizing the required information to one-fourth compared with chunk, optimizing both efficiency and effectiveness.
Problem

Research questions and friction points this paper is trying to address.

Addressing logical coherence loss in how-to question answering
Improving step-by-step reasoning for dynamic decision-making processes
Enhancing retrieval efficiency for complex procedural information
Innovation

Methods, ideas, or system contributions that make the work stand out.

Thread organizes data into logic units for coherence
Logic units transform documents into structured interconnected segments
Thread reduces retrieval information by up to 75%