🤖 AI Summary
To address efficiency bottlenecks in Retrieval-Augmented Generation (RAG) over tree-structured data, this paper proposes a bottom-up tree linearization method that hierarchically aggregates node representations to generate implicit level-wise summaries, compressing the original tree knowledge into a compact linear sequence. The approach integrates implicit knowledge modeling with in-context learning, eliminating reliance on explicit external document retrieval and substantially reducing RAG’s dependency on raw documents. Experiments demonstrate that our method reduces the number of retrieved documents by over 68% compared to conventional RAG, while preserving response quality and significantly improving efficiency and scalability for deep hierarchical data. The core contribution lies in the first incorporation of an implicit semantic aggregation mechanism—native to tree structures—into the linearization process, thereby achieving joint optimization of knowledge density and reasoning efficiency.
📝 Abstract
Large Language Models (LLMs) are adept at generating responses based on information within their context. While this ability is useful for interacting with structured data like code files, another popular method, Retrieval-Augmented Generation (RAG), retrieves relevant documents to augment the model's in-context learning. However, it is not well-explored how to best represent this retrieved knowledge for generating responses on structured data, particularly hierarchical structures like trees. In this work, we propose a novel bottom-up method to linearize knowledge from tree-like structures (like a GitHub repository) by generating implicit, aggregated summaries at each hierarchical level. This approach enables the knowledge to be stored in a knowledge base and used directly with RAG. We then compare our method to using RAG on raw, unstructured code, evaluating the accuracy and quality of the generated responses. Our results show that while response quality is comparable across both methods, our approach generates over 68% fewer documents in the retriever, a significant gain in efficiency. This finding suggests that leveraging implicit, linearized knowledge may be a highly effective and scalable strategy for handling complex, hierarchical data structures.