🤖 AI Summary
Existing classifier agreement metrics (e.g., Cohen’s kappa) lack a statistical significance assessment framework, rendering their numerical values difficult to interpret objectively. Method: We propose the first general-purpose significance evaluation framework, introducing two novel indices: (i) an empirical significance index for finite samples—built upon Monte Carlo hypothesis testing and an efficient numerical algorithm—and (ii) an asymptotic significance index for classification probability distributions—characterizing statistical meaning in the large-sample limit. Contribution/Results: Our framework yields rigorous p-values and data-driven significance thresholds for any agreement metric, eliminating subjective interpretive boundaries. Empirically validated on medical evaluation and AI model compression tasks, it demonstrates robustness and practical utility, advancing the paradigm from “empirical agreement” to “statistically reliable agreement.”
📝 Abstract
Agreement measures, such as Cohen's kappa or intraclass correlation, gauge the matching between two or more classifiers. They are used in a wide range of contexts from medicine, where they evaluate the effectiveness of medical treatments and clinical trials, to artificial intelligence, where they can quantify the approximation due to the reduction of a classifier. The consistency of different classifiers to a golden standard can be compared simply by using the order induced by their agreement measure with respect to the golden standard itself. Nevertheless, labelling an approach as good or bad exclusively by using the value of an agreement measure requires a scale or a significativity index. Some quality scales have been proposed in the literature for Cohen's kappa, but they are mainly naive, and their boundaries are arbitrary. This work proposes a general approach to evaluate the significativity of any agreement value between two classifiers and introduces two significativity indices: one dealing with finite data sets, the other one handling classification probability distributions. Moreover, this manuscript considers the computational issues of evaluating such indices and identifies some efficient algorithms to evaluate them.