Predicting Retrieval Utility and Answer Quality in Retrieval-Augmented Generation

📅 2026-01-20

📈 Citations: 0

✨ Influential: 0

career value

162K/year

🤖 AI Summary

This work addresses the challenge of accurately predicting the utility of retrieved documents for final answer quality in retrieval-augmented generation (RAG). It introduces two novel prediction tasks—Retrieval Performance Prediction (RPP) and Generation Performance Prediction (GPP)—thereby extending query performance prediction into the RAG framework for the first time. The authors propose a linear regression model that jointly leverages three categories of features: retriever-centric (e.g., query-document relevance), reader-centric (e.g., LLM conditional perplexity), and intrinsic document quality (e.g., readability). Experimental results on the Natural Questions dataset demonstrate that this multi-feature fusion strategy significantly improves the accuracy of both RPP and GPP, offering an effective utility evaluation mechanism for RAG systems.

Technology Category

Application Category

📝 Abstract

The quality of answers generated by large language models (LLMs) in retrieval-augmented generation (RAG) is largely influenced by the contextual information contained in the retrieved documents. A key challenge for improving RAG is to predict both the utility of retrieved documents -- quantified as the performance gain from using context over generation without context -- and the quality of the final answers in terms of correctness and relevance. In this paper, we define two prediction tasks within RAG. The first is retrieval performance prediction (RPP), which estimates the utility of retrieved documents. The second is generation performance prediction (GPP), which estimates the final answer quality. We hypothesise that in RAG, the topical relevance of retrieved documents correlates with their utility, suggesting that query performance prediction (QPP) approaches can be adapted for RPP and GPP. Beyond these retriever-centric signals, we argue that reader-centric features, such as the LLM's perplexity of the retrieved context conditioned on the input query, can further enhance prediction accuracy for both RPP and GPP. Finally, we propose that features reflecting query-agnostic document quality and readability can also provide useful signals to the predictions. We train linear regression models with the above categories of predictors for both RPP and GPP. Experiments on the Natural Questions (NQ) dataset show that combining predictors from multiple feature categories yields the most accurate estimates of RAG performance.

Problem

Research questions and friction points this paper is trying to address.

Retrieval-Augmented Generation

Retrieval Utility Prediction

Answer Quality Prediction

Query Performance Prediction

Large Language Models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Retrieval-Augmented Generation

Performance Prediction

Query Performance Prediction