🤖 AI Summary
In multi-system information retrieval (IR) evaluation, multiple hypothesis testing inflates Type I error rates, undermining result reliability.
Method: This paper proposes a robustness assessment framework integrating simulated and real TREC data to systematically evaluate multiple correction methods for controlling false positives in IR settings.
Contribution/Results: It presents the first empirical comparison of various multiple-comparison correction techniques—assessing both Type I error control and statistical power—under realistic IR conditions. Results show that the Wilcoxon signed-rank test combined with Benjamini–Hochberg (BH) false discovery rate (FDR) correction achieves optimal performance: it strictly maintains the nominal significance level (i.e., controls Type I error at the target rate) while delivering the highest statistical power across typical sample sizes. The framework establishes a reproducible, high-reliability statistical validation paradigm for multi-system IR evaluation, significantly enhancing result credibility and decision robustness.
📝 Abstract
Null Hypothesis Significance Testing is the extit{de facto} tool for assessing effectiveness differences between Information Retrieval systems. Researchers use statistical tests to check whether those differences will generalise to online settings or are just due to the samples observed in the laboratory. Much work has been devoted to studying which test is the most reliable when comparing a pair of systems, but most of the IR real-world experiments involve more than two. In the multiple comparisons scenario, testing several systems simultaneously may inflate the errors committed by the tests. In this paper, we use a new approach to assess the reliability of multiple comparison procedures using simulated and real TREC data. Experiments show that Wilcoxon plus the Benjamini-Hochberg correction yields Type I error rates according to the significance level for typical sample sizes while being the best test in terms of statistical power.