🤖 AI Summary
This paper addresses key challenges in mortgage credit risk prediction—namely, modeling nonlinear effects, variable interactions, and unobserved spatio-temporal heterogeneity. To this end, we propose a Tree-Augmented Latent Spatio-Temporal Gaussian Process (GP) model. Our approach uniquely couples gradient-boosted trees (e.g., XGBoost or LightGBM) with latent-variable GPs: the former captures complex nonlinearities and high-order interactions, while the latter models unobserved regional and temporal vulnerability structures. Integrated within a random-effects framework, the model employs efficient variational inference for scalable learning and leverages SHAP for post-hoc interpretability. Extensive experiments on large-scale U.S. mortgage data demonstrate that our method significantly improves both individual default probability prediction accuracy and calibration of portfolio loss distributions—outperforming conventional independent linear models and linear spatio-temporal baselines across all metrics.
📝 Abstract
We introduce a novel machine learning model for credit risk by combining tree-boosting with a latent spatio-temporal Gaussian process model accounting for frailty correlation. This allows for modeling non-linearities and interactions among predictor variables in a flexible data-driven manner and for accounting for spatio-temporal variation that is not explained by observable predictor variables. We also show how estimation and prediction can be done in a computationally efficient manner. In an application to a large U.S. mortgage credit risk data set, we find that both predictive default probabilities for individual loans and predictive loan portfolio loss distributions obtained with our novel approach are more accurate compared to conventional independent linear hazard models and also linear spatio-temporal models. Using interpretability tools for machine learning models, we find that the likely reasons for this outperformance are strong interaction and non-linear effects in the predictor variables and the presence of large spatio-temporal frailty effects.