Towards Reliable Testing for Multiple Information Retrieval System Comparisons

📅 2025-01-07

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

In multi-system information retrieval (IR) evaluation, multiple hypothesis testing inflates Type I error rates, undermining result reliability. Method: This paper proposes a robustness assessment framework integrating simulated and real TREC data to systematically evaluate multiple correction methods for controlling false positives in IR settings. Contribution/Results: It presents the first empirical comparison of various multiple-comparison correction techniques—assessing both Type I error control and statistical power—under realistic IR conditions. Results show that the Wilcoxon signed-rank test combined with Benjamini–Hochberg (BH) false discovery rate (FDR) correction achieves optimal performance: it strictly maintains the nominal significance level (i.e., controls Type I error at the target rate) while delivering the highest statistical power across typical sample sizes. The framework establishes a reproducible, high-reliability statistical validation paradigm for multi-system IR evaluation, significantly enhancing result credibility and decision robustness.

Technology Category

Application Category

📝 Abstract

Null Hypothesis Significance Testing is the extit{de facto} tool for assessing effectiveness differences between Information Retrieval systems. Researchers use statistical tests to check whether those differences will generalise to online settings or are just due to the samples observed in the laboratory. Much work has been devoted to studying which test is the most reliable when comparing a pair of systems, but most of the IR real-world experiments involve more than two. In the multiple comparisons scenario, testing several systems simultaneously may inflate the errors committed by the tests. In this paper, we use a new approach to assess the reliability of multiple comparison procedures using simulated and real TREC data. Experiments show that Wilcoxon plus the Benjamini-Hochberg correction yields Type I error rates according to the significance level for typical sample sizes while being the best test in terms of statistical power.

Problem

Research questions and friction points this paper is trying to address.

Information Retrieval

Hypothesis Testing

Statistical Significance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Wilcoxon Rank Sum Test

Benjamini-Hochberg Procedure

Error Rate Control

🔎 Similar Papers

No similar papers found.