🤖 AI Summary
Existing agent-based RAG approaches improve LLM reliability via reinforcement learning but incur substantial token overhead in retrieval and reasoning, trading efficiency for accuracy. This paper proposes an efficient RAG framework addressing this trade-off. First, it introduces a retrieval compression mechanism integrating knowledge-association graphs with personalized PageRank, enabling joint semantic chunk retrieval, graph-structured triplet retrieval, and knowledge matching. Second, it proposes Iterative Process-aware Direct Preference Optimization (IP-DPO), which explicitly models and optimizes the number of reasoning steps. Evaluated on six benchmarks, our method improves accuracy by 4% (Llama3-8B) and 2% (Qwen2.5-14B) on average, while reducing output tokens by 61% and 59%, respectively—demonstrating significant gains in both accuracy and generation efficiency.
📝 Abstract
Retrieval-Augmented Generation (RAG) utilizes external knowledge to augment Large Language Models'(LLMs) reliability. For flexibility, agentic RAG employs autonomous, multi-round retrieval and reasoning to resolve queries. Although recent agentic RAG has improved via reinforcement learning, they often incur substantial token overhead from search and reasoning processes. This trade-off prioritizes accuracy over efficiency. To address this issue, this work proposes TeaRAG, a token-efficient agentic RAG framework capable of compressing both retrieval content and reasoning steps. 1) First, the retrieved content is compressed by augmenting chunk-based semantic retrieval with a graph retrieval using concise triplets. A knowledge association graph is then built from semantic similarity and co-occurrence. Finally, Personalized PageRank is leveraged to highlight key knowledge within this graph, reducing the number of tokens per retrieval. 2) Besides, to reduce reasoning steps, Iterative Process-aware Direct Preference Optimization (IP-DPO) is proposed. Specifically, our reward function evaluates the knowledge sufficiency by a knowledge matching mechanism, while penalizing excessive reasoning steps. This design can produce high-quality preference-pair datasets, supporting iterative DPO to improve reasoning conciseness. Across six datasets, TeaRAG improves the average Exact Match by 4% and 2% while reducing output tokens by 61% and 59% on Llama3-8B-Instruct and Qwen2.5-14B-Instruct, respectively. Code is available at https://github.com/Applied-Machine-Learning-Lab/TeaRAG.