🤖 AI Summary
This work proposes an interpretable statistical inference framework that decomposes predictive scoring functions into three components—calibration error, discrimination ability, and uncertainty—applicable to multi-step-ahead point forecasts such as means and quantiles, and compatible with both smooth and nonsmooth scoring rules. Building on linear recalibration and integrating Mincer–Zarnowitz regression with asymptotic inference theory, the method delivers the first fully interpretable tripartite decomposition for general scoring functions, unifying and extending classical calibration tests and predictive performance evaluation. Empirical applications to inflation surveys and financial risk models reveal critical discrepancies obscured by aggregate scores, exposing a misalignment between backtesting practices and predictive accuracy in banking regulation, thereby substantially enhancing the informativeness and statistical power of forecast evaluation.
📝 Abstract
We introduce inference methods for score decompositions, which partition scoring functions for predictive assessment into three interpretable components: miscalibration, discrimination, and uncertainty. Our estimation and inference relies on a linear recalibration of the forecasts, which is applicable to general multi-step ahead point forecasts such as means and quantiles due to its validity for both smooth and non-smooth scoring functions. This approach ensures desirable finite-sample properties, enables asymptotic inference, and establishes a direct connection to the classical Mincer-Zarnowitz regression. The resulting inference framework facilitates tests for equal forecast calibration or discrimination, which yield three key advantages. They enhance the information content of predictive ability tests by decomposing scores, deliver higher statistical power in certain scenarios, and formally connect scoring-function-based evaluation to traditional calibration tests, such as financial backtests. Applications demonstrate the method's utility. We find that for survey inflation forecasts, discrimination abilities can differ significantly even when overall predictive ability does not. In an application to financial risk models, our tests provide deeper insights into the calibration and information content of volatility and Value-at-Risk forecasts. By disentangling forecast accuracy from backtest performance, the method exposes critical shortcomings in current banking regulation.