🤖 AI Summary
In open-domain question answering, external evidence retrieval suffers from inflexible top-$k$ selection: fixed $k$ leads to either token waste or omission of critical information, while existing adaptive methods underperform on aggregative QA. This paper proposes Adaptive-$k$, a single-forward, fine-tuning-free, iteration-free, and LLM-call-free adaptive retrieval method that dynamically determines the optimal number of retrieved passages based on the statistical distribution of query-passage similarity scores. By adaptively thresholding the similarity distribution, Adaptive-$k$ achieves high-relevance recall (70%) and up to 10× token savings. It is agnostic to retrieval models and LLMs, and integrates seamlessly into RAG and long-context LLM (LCLM) pipelines. Extensive evaluation shows that Adaptive-$k$ consistently outperforms fixed-$k$ baselines across both factoid and aggregative QA benchmarks, significantly improving accuracy and the efficiency–effectiveness trade-off for five long-context LLMs and two embedding models.
📝 Abstract
Retrieval-augmented generation (RAG) and long-context language models (LCLMs) both address context limitations of LLMs in open-domain question answering (QA). However, optimal external context to retrieve remains an open problem: fixing the retrieval size risks either wasting tokens or omitting key evidence. Existing adaptive methods like Self-RAG and Self-Route rely on iterative LLM prompting and perform well on factoid QA, but struggle with aggregation QA, where the optimal context size is both unknown and variable. We present Adaptive-$k$ retrieval, a simple and effective single-pass method that adaptively selects the number of passages based on the distribution of the similarity scores between the query and the candidate passages. It does not require model fine-tuning, extra LLM inferences or changes to existing retriever-reader pipelines. On both factoid and aggregation QA benchmarks, Adaptive-$k$ matches or outperforms fixed-$k$ baselines while using up to 10x fewer tokens than full-context input, yet still retrieves 70% of relevant passages. It improves accuracy across five LCLMs and two embedding models, highlighting that dynamically adjusting context size leads to more efficient and accurate QA.