Test-Time Strategies for More Efficient and Accurate Agentic RAG

📅 2026-03-12

📈 Citations: 0

✨ Influential: 0

career value

162K/year

🤖 AI Summary

This work addresses the inefficiency and low accuracy of agent-based RAG systems in multi-hop question answering, which stem from redundant retrieval and insufficient context integration. Without modifying the training procedure, the authors propose a test-time optimization strategy built upon the Search-R1 framework. Their approach introduces an LLM-driven contextualization module to enhance the fusion of retrieved information and incorporates a deduplication mechanism that dynamically replaces redundant documents. Evaluated on the HotpotQA and Natural Questions benchmarks, the method achieves a 5.6% absolute improvement in Exact Match (EM) score and reduces the average number of retrieval rounds by 10.5%, substantially enhancing both reasoning accuracy and computational efficiency.

Technology Category

Application Category

📝 Abstract

Retrieval-Augmented Generation (RAG) systems face challenges with complex, multihop questions, and agentic frameworks such as Search-R1 (Jin et al., 2025), which operates iteratively, have been proposed to address these complexities. However, such approaches can introduce inefficiencies, including repetitive retrieval of previously processed information and challenges in contextualizing retrieved results effectively within the current generation prompt. Such issues can lead to unnecessary retrieval turns, suboptimal reasoning, inaccurate answers, and increased token consumption. In this paper, we investigate test-time modifications to the Search-R1 pipeline to mitigate these identified shortcomings. Specifically, we explore the integration of two components and their combination: a contextualization module to better integrate relevant information from retrieved documents into reasoning, and a de-duplication module that replaces previously retrieved documents with the next most relevant ones. We evaluate our approaches using the HotpotQA (Yang et al., 2018) and the Natural Questions (Kwiatkowski et al., 2019) datasets, reporting the exact match (EM) score, an LLM-as-a-Judge assessment of answer correctness, and the average number of turns. Our best-performing variant, utilizing GPT-4.1-mini for contextualization, achieves a 5.6% increase in EM score and reduces the number of turns by 10.5% compared to the Search-R1 baseline, demonstrating improved answer accuracy and retrieval efficiency.

Problem

Research questions and friction points this paper is trying to address.

Retrieval-Augmented Generation

multihop questions

test-time efficiency

repetitive retrieval

contextualization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Test-Time Optimization

Contextualization Module

De-duplication Module