BookRAG: A Hierarchical Structure-aware Index-based Approach for Retrieval-Augmented Generation on Complex Documents

📅 2025-12-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing RAG methods struggle to effectively retrieve from complex, hierarchically structured documents—such as books—due to their neglect of document hierarchy and semantic interdependencies, resulting in low retrieval precision. This paper proposes BookRAG, a retrieval-augmented generation framework specifically designed for hierarchical documents. Its core innovation lies in the first joint modeling of document hierarchy and entity relationships: hierarchical tree extraction captures structural granularity, while a graph neural network constructs an entity-relation graph encoding semantic associations; further, an agent-based dynamic query strategy—grounded in information foraging theory—enables precise, iterative, multi-granularity content localization. Evaluated on three established benchmarks, BookRAG significantly outperforms state-of-the-art methods, achieving 12.6–18.3% higher retrieval recall and 9.4–15.7% improvement in question-answering accuracy, while maintaining efficient inference.

Technology Category

Application Category

📝 Abstract
As an effective method to boost the performance of Large Language Models (LLMs) on the question answering (QA) task, Retrieval-Augmented Generation (RAG), which queries highly relevant information from external complex documents, has attracted tremendous attention from both industry and academia. Existing RAG approaches often focus on general documents, and they overlook the fact that many real-world documents (such as books, booklets, handbooks, etc.) have a hierarchical structure, which organizes their content from different granularity levels, leading to poor performance for the QA task. To address these limitations, we introduce BookRAG, a novel RAG approach targeted for documents with a hierarchical structure, which exploits logical hierarchies and traces entity relations to query the highly relevant information. Specifically, we build a novel index structure, called BookIndex, by extracting a hierarchical tree from the document, which serves as the role of its table of contents, using a graph to capture the intricate relationships between entities, and mapping entities to tree nodes. Leveraging the BookIndex, we then propose an agent-based query method inspired by the Information Foraging Theory, which dynamically classifies queries and employs a tailored retrieval workflow. Extensive experiments on three widely adopted benchmarks demonstrate that BookRAG achieves state-of-the-art performance, significantly outperforming baselines in both retrieval recall and QA accuracy while maintaining competitive efficiency.
Problem

Research questions and friction points this paper is trying to address.

Addresses poor QA performance on hierarchically structured documents.
Proposes BookRAG to exploit logical hierarchies and entity relations.
Enhances retrieval recall and QA accuracy via a novel index structure.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical tree index for structured documents
Graph-based entity relation mapping
Agent-based dynamic query classification
🔎 Similar Papers
No similar papers found.