🤖 AI Summary
This work addresses a pervasive train-test logical misalignment in classifier calibration evaluation, introducing the “Fit-on-the-Test” perspective. It reveals that standard metrics—such as Expected Calibration Error (ECE)—implicitly refit the calibration mapping on the test set, inducing optimistic bias and compromising reliability. Through theoretical analysis, calibration error decomposition, Monte Carlo simulations, and empirical evaluation across multiple benchmarks (CIFAR-10/100, ImageNet subsets), we systematically demonstrate, for the first time, substantial performance degradation of mainstream calibration methods under this perspective. Building on these findings, we propose a more rigorous calibration evaluation framework and a revised evaluation protocol that explicitly prevents implicit test-set refitting. Our approach enhances assessment reliability, statistical unbiasedness, and cross-method comparability—thereby enabling fairer, more trustworthy calibration evaluation.