🤖 AI Summary
This study addresses the lack of reliable validation mechanisms for Bayes factor computations. To this end, we propose two novel calibration methods: (1) an enhanced simulation-based calibration (SBC) procedure, specifically adapted to handle posterior distributions under improper priors; and (2) a binary-prediction-based calibration metric. We comparatively evaluate these against established approaches—including data-averaged posterior checks and the Good test—and find that binary-prediction calibration achieves higher sensitivity under limited computational budgets, whereas SBC detects a broader spectrum of inferential errors. Empirical experiments demonstrate that mainstream R packages—such as *bridgesampling* and *BayesFactor*—exhibit robust performance under default settings. We recommend that new implementations conduct at least several hundred simulation-based calibration runs for rigorous validation. Overall, this work establishes a more efficient and robust framework for validating Bayesian inference, particularly in Bayes factor computation.
📝 Abstract
We propose and evaluate two methods that validate the computation of Bayes factors: one based on an improved variant of simulation-based calibration checking (SBC) and one based on calibration metrics for binary predictions. We show that in theory, binary prediction calibration is equivalent to a special case of SBC, but with finite resources, binary prediction calibration is typically more sensitive. With well-designed test quantities, SBC can however detect all possible problems in computation, including some that cannot be uncovered by binary prediction calibration.
Previous work on Bayes factor validation includes checks based on the data-averaged posterior and the Good check method. We demonstrate that both checks miss many problems in Bayes factor computation detectable with SBC and binary prediction calibration. Moreover, we find that the Good check as originally described fails to control its error rates. Our proposed checks also typically use simulation results more efficiently than data-averaged posterior checks. Finally, we show that a special approach based on posterior SBC is necessary when checking Bayes factor computation under improper priors and we validate several models with such priors.
We recommend that novel methods for Bayes factor computation be validated with SBC and binary prediction calibration with at least several hundred simulations. For all the models we tested, the bridgesampling and BayesFactor R packages satisfy all available checks and thus are likely safe to use in standard scenarios.