🤖 AI Summary
This work addresses two key challenges in retrieval-augmented generation (RAG): (1) the sensitivity of large language model (LLM) question-answering performance to the ranking quality of retrieved documents, and (2) the reliance of existing evaluation methods on ground-truth answers. To overcome these limitations, we propose an unsupervised evaluation and optimization framework that requires no prior answer knowledge. Our core innovation is the first application of pointwise mutual information (PMI) to RAG, where PMI quantifies semantic consistency between retrieved documents and generated answers. Leveraging this metric, we design a PMI-driven document re-ranking strategy and a dynamic prompt generation mechanism. Experiments on two standard QA benchmarks demonstrate a statistically significant positive correlation between PMI scores and answer accuracy (p < 0.01). Our method consistently improves QA accuracy across multiple LLMs, yielding an average gain of 5.2%, thereby validating its generality and effectiveness.
📝 Abstract
Recent work suggests that large language models enhanced with retrieval-augmented generation are easily influenced by the order, in which the retrieved documents are presented to the model when solving tasks such as question answering (QA). However, there is no method to date that exploits this phenomenon to improve generation. We fill this gap. In this study, we show that the pointwise mutual information between a context and a question is an effective gauge for language model performance. Importantly, this gauge does not depend on knowing the answer to the question a priori. Through experiments on two question-answering datasets and a variety of large language models, we find evidence for an empirical correlation between answer accuracy and pointwise mutual information. Additionally, we propose two methods that use the pointwise mutual information between a document and a question as a gauge for selecting and constructing prompts that lead to better performance, whose effectiveness we demonstrate through experimentation.