Efficient Test-Time Retrieval Augmented Generation

📅 2025-11-02

📈 Citations: 0

✨ Influential: 0

career value

171K/year

🤖 AI Summary

To address factual inaccuracies in large language models (LLMs) stemming from overreliance on parametric knowledge and the susceptibility of retrieval-augmented generation (RAG) to irrelevant retrieved documents, this paper proposes a training-free, test-time inference framework. The method retrieves lightweight external knowledge, generates multiple candidate responses, and performs efficient majority voting based on semantic similarity among response prefixes—enabling robust filtering of incorrect answers. Its key innovations are: (i) consensus determination using only partial response prefixes, drastically reducing computational overhead; and (ii) synergistic integration of RAG’s knowledge grounding with ensemble methods’ robustness. Evaluated on open-domain question answering, recipe generation, and image captioning, the approach significantly improves accuracy while maintaining low latency, demonstrating both efficiency and strong generalization across diverse tasks.

Technology Category

Application Category

📝 Abstract

Although Large Language Models (LLMs) demonstrate significant capabilities, their reliance on parametric knowledge often leads to inaccuracies. Retrieval Augmented Generation (RAG) mitigates this by incorporating external knowledge, but these methods may introduce irrelevant retrieved documents, leading to inaccurate responses. While the integration methods filter out incorrect answers from multiple responses, but lack external knowledge like RAG methods, and their high costs require balancing overhead with performance gains. To address these issues, we propose an Efficient Test-Time Retrieval-Augmented Generation Framework named ET2RAG to improve the performance of LLMs while maintaining efficiency. Specifically, ET2RAG is a training-free method, that first retrieves the most relevant documents and augments the LLMs to efficiently generate diverse candidate responses by managing response length. Then we compute the similarity of candidate responses and employ a majority voting mechanism to select the most suitable response as the final output. In particular, we discover that partial generation is sufficient to capture the key information necessary for consensus calculation, allowing us to effectively perform majority voting without the need for fully generated responses. Thus, we can reach a balance between computational cost and performance by managing the response length for the number of retrieved documents for majority voting. Experimental results demonstrate that ET2RAG significantly enhances performance across three tasks, including open-domain question answering, recipe generation and image captioning.

Problem

Research questions and friction points this paper is trying to address.

LLMs rely on parametric knowledge causing inaccuracies in responses

RAG methods introduce irrelevant documents leading to incorrect answers

Existing methods lack external knowledge and have high computational costs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free framework retrieves relevant external documents

Generates diverse responses by managing output length

Uses partial generation and majority voting for selection

🔎 Similar Papers

Retrieval-Augmented Test Generation: How Far Are We?