Beyond Correlations: A Downstream Evaluation Framework for Query Performance Prediction

📅 2026-01-24

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

This work addresses the limitations of traditional query performance prediction (QPP) evaluation, which relies on collection-level relevance and fails to reflect per-query effectiveness or guide downstream retrieval decisions. The authors propose a novel downstream-oriented QPP evaluation framework that directly utilizes QPP estimates for weighted result fusion. By modeling the distribution of QPP scores across multiple rankers as a fusion prior, the method integrates weighted CombSUM and reciprocal rank fusion (RRF) strategies to enhance retrieval performance. Experimental results demonstrate that this approach improves effectiveness by over 4.5% compared to unweighted baselines in fusion tasks, thereby validating the practical utility of QPP in real-world information retrieval pipelines. Moreover, the study reveals a weak correlation between standard relevance-based metrics and downstream effectiveness, underscoring the need for task-aware QPP evaluation.

Technology Category

Application Category

📝 Abstract

The standard practice of query performance prediction (QPP) evaluation is to measure a set-level correlation between the estimated retrieval qualities and the true ones. However, neither this correlation-based evaluation measure quantifies QPP effectiveness at the level of individual queries, nor does this connect to a downstream application, meaning that QPP methods yielding high correlation values may not find a practical application in query-specific decisions in an IR pipeline. In this paper, we propose a downstream-focussed evaluation framework where a distribution of QPP estimates across a list of top-documents retrieved with several rankers is used as priors for IR fusion. While on the one hand, a distribution of these estimates closely matching that of the true retrieval qualities indicates the quality of the predictor, their usage as priors on the other hand indicates a predictor's ability to make informed choices in an IR pipeline. Our experiments firstly establish the importance of QPP estimates in weighted IR fusion, yielding substantial improvements of over 4.5% over unweighted CombSUM and RRF fusion strategies, and secondly, reveal new insights that the downstream effectiveness of QPP does not correlate well with the standard correlation-based QPP evaluation.

Problem

Research questions and friction points this paper is trying to address.

query performance prediction

evaluation framework

downstream application

information retrieval

correlation-based evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

query performance prediction

downstream evaluation

IR fusion