🤖 AI Summary
This work investigates whether query performance prediction (QPP) method fusion enhances prediction quality and examines the reproducibility of prior findings across new models, evaluation metrics, and datasets. Method: We systematically integrate supervised neural QPP models—previously excluded from fusion frameworks—into a unified fusion pipeline, evaluating them on modern benchmarks including ClueWeb09B and TREC Deep Learning, using both pre- and post-retrieval approaches. Performance is assessed via sMARE alongside conventional metrics (e.g., Pearson, Spearman). We further propose a fine-grained complementarity criterion grounded in inter-method correlation to quantify information overlap. Contribution/Results: Most classical fusion conclusions remain robust; sMARE demonstrates superior sensitivity in distinguishing effective versus ineffective fusions; and highly correlated method combinations often degrade performance due to redundancy. This study establishes a new paradigm for QPP fusion, introduces a theoretically grounded complementarity criterion, and provides an updated empirical benchmark for future research.
📝 Abstract
A large number of approaches to Query Performance Prediction (QPP) have been proposed over the last two decades. As early as 2009, Hauff et al. [28] explored whether different QPP methods may be combined to improve prediction quality. Since then, significant research has been done both on QPP approaches, as well as their evaluation. This study revisits Hauff et al.s work to assess the reproducibility of their findings in the light of new prediction methods, evaluation metrics, and datasets. We expand the scope of the earlier investigation by: (i) considering post-retrieval methods, including supervised neural techniques (only pre-retrieval techniques were studied in [28]); (ii) using sMARE for evaluation, in addition to the traditional correlation coefficients and RMSE; and (iii) experimenting with additional datasets (Clueweb09B and TREC DL). Our results largely support previous claims, but we also present several interesting findings. We interpret these findings by taking a more nuanced look at the correlation between QPP methods, examining whether they capture diverse information or rely on overlapping factors.