🤖 AI Summary
This work addresses the limitations of traditional query performance prediction (QPP) evaluation, which relies on collection-level relevance and fails to reflect per-query effectiveness or guide downstream retrieval decisions. The authors propose a novel downstream-oriented QPP evaluation framework that directly utilizes QPP estimates for weighted result fusion. By modeling the distribution of QPP scores across multiple rankers as a fusion prior, the method integrates weighted CombSUM and reciprocal rank fusion (RRF) strategies to enhance retrieval performance. Experimental results demonstrate that this approach improves effectiveness by over 4.5% compared to unweighted baselines in fusion tasks, thereby validating the practical utility of QPP in real-world information retrieval pipelines. Moreover, the study reveals a weak correlation between standard relevance-based metrics and downstream effectiveness, underscoring the need for task-aware QPP evaluation.
📝 Abstract
The standard practice of query performance prediction (QPP) evaluation is to measure a set-level correlation between the estimated retrieval qualities and the true ones. However, neither this correlation-based evaluation measure quantifies QPP effectiveness at the level of individual queries, nor does this connect to a downstream application, meaning that QPP methods yielding high correlation values may not find a practical application in query-specific decisions in an IR pipeline. In this paper, we propose a downstream-focussed evaluation framework where a distribution of QPP estimates across a list of top-documents retrieved with several rankers is used as priors for IR fusion. While on the one hand, a distribution of these estimates closely matching that of the true retrieval qualities indicates the quality of the predictor, their usage as priors on the other hand indicates a predictor's ability to make informed choices in an IR pipeline. Our experiments firstly establish the importance of QPP estimates in weighted IR fusion, yielding substantial improvements of over 4.5% over unweighted CombSUM and RRF fusion strategies, and secondly, reveal new insights that the downstream effectiveness of QPP does not correlate well with the standard correlation-based QPP evaluation.