PRECISE: Reducing the Bias of LLM Evaluations Using Prediction-Powered Ranking Estimation

📅 2026-01-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of traditional evaluation methods for search and retrieval-augmented generation (RAG) systems, which rely heavily on costly human annotations, and the inherent biases of large language model (LLM)-based automatic evaluators that struggle to reliably estimate metrics requiring fine-grained sub-instance labels. The paper proposes the first statistical framework extending Prediction-Powered Inference (PPI) to the query-document annotation setting. By reformulating the metric integration space, the approach reduces computational complexity from O(2^|C|) to O(2^K), substantially improving scalability. With only 100 human-annotated samples and 10,000 unlabeled instances, the method effectively corrects LLM-induced biases and significantly reduces estimation variance for key metrics such as Precision@K across standard retrieval benchmarks.

Technology Category

Application Category

📝 Abstract
Evaluating the quality of search, ranking and RAG systems traditionally requires a significant number of human relevance annotations. In recent times, several deployed systems have explored the usage of Large Language Models (LLMs) as automated judges for this task while their inherent biases prevent direct use for metric estimation. We present a statistical framework extending Prediction-Powered Inference (PPI) that combines minimal human annotations with LLM judgments to produce reliable estimates of metrics which require sub-instance annotations. Our method requires as few as 100 human-annotated queries and 10,000 unlabeled examples, reducing annotation requirements significantly compared to traditional approaches. We formulate our proposed framework (PRECISE) for inference of relevance uplift for an LLM-based query reformulation application, extending PPI to sub-instance annotations at the query-document level. By reformulating the metric-integration space, we reduced the computational complexity from O(2^|C|) to O(2^K), where |C| represents corpus size (in order of millions). Detailed experiments across prominent retrieval datasets demonstrate that our method reduces the variance of estimates for the business-critical Precision@K metric, while effectively correcting for LLM bias in low-resource settings.
Problem

Research questions and friction points this paper is trying to address.

LLM bias
evaluation metrics
relevance estimation
human annotation
ranking evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Prediction-Powered Inference
LLM bias correction
sub-instance annotation
metric estimation
low-resource evaluation
🔎 Similar Papers
No similar papers found.