🤖 AI Summary
This work investigates how to efficiently select the optimal query variant from multiple semantically equivalent reformulations of the same information need within a retrieval-augmented generation (RAG) pipeline to enhance end-to-end generation quality. To this end, it introduces query performance prediction (QPP) for the first time to the task of in-topic query variant selection, systematically comparing pre-retrieval and post-retrieval QPP approaches. The study conducts large-scale evaluations on the TREC-RAG dataset using both sparse and dense retrievers, revealing a “utility gap” between retrieval metrics and downstream generation quality. It demonstrates that lightweight pre-retrieval predictors can effectively identify variants that outperform the original query, often matching or even surpassing more complex post-retrieval methods while significantly reducing latency without compromising generation quality.
📝 Abstract
Large Language Models (LLMs) have made query reformulation ubiquitous in modern retrieval and Retrieval-Augmented Generation (RAG) pipelines, enabling the generation of multiple semantically equivalent query variants. However, executing the full pipeline for every reformulation is computationally expensive, motivating selective execution: can we identify the best query variant before incurring downstream retrieval and generation costs? We investigate Query Performance Prediction (QPP) as a mechanism for variant selection across ad-hoc retrieval and end-to-end RAG. Unlike traditional QPP, which estimates query difficulty across topics, we study intra-topic discrimination - selecting the optimal reformulation among competing variants of the same information need. Through large-scale experiments on TREC-RAG using both sparse and dense retrievers, we evaluate pre- and post-retrieval predictors under correlation- and decision-based metrics. Our results reveal a systematic divergence between retrieval and generation objectives: variants that maximize ranking metrics such as nDCG often fail to produce the best generated answers, exposing a "utility gap" between retrieval relevance and generation fidelity. Nevertheless, QPP can reliably identify variants that improve end-to-end quality over the original query. Notably, lightweight pre-retrieval predictors frequently match or outperform more expensive post-retrieval methods, offering a latency-efficient approach to robust RAG.