🤖 AI Summary
This work addresses two fundamental questions: “What constitutes a principled predictive evaluation metric?” and “How are existing metrics formally related?” We propose a unified evaluation framework grounded in game theory, introducing the first four-dimensional “predictive welfare” metric—comprising calibration, predictability, randomness, and regret—to holistically assess prediction quality. Theoretically, we rigorously prove the equivalence between calibration and regret, and establish a duality between predictive superiority and outcome randomness. Methodologically, we formalize the framework by integrating probabilistic calibration, regret analysis, and algorithmic randomness measures—specifically Martingale difference sequences. Our approach provides a more rigorous theoretical foundation for predictive evaluation, significantly enhancing both the interpretability and robustness assessment of trustworthy AI predictions.
📝 Abstract
Machine learning is about forecasting. Forecasts, however, obtain their usefulness only through their evaluation. Machine learning has traditionally focused on types of losses and their corresponding regret. Currently, the machine learning community regained interest in calibration. In this work, we show the conceptual equivalence of calibration and regret in evaluating forecasts. We frame the evaluation problem as a game between a forecaster, a gambler and nature. Putting intuitive restrictions on gambler and forecaster, calibration and regret naturally fall out of the framework. In addition, this game links evaluation of forecasts to randomness of outcomes. Random outcomes with respect to forecasts are equivalent to good forecasts with respect to outcomes. We call those dual aspects, calibration and regret, predictiveness and randomness, the four facets of forecast felicity.