🤖 AI Summary
Neural posterior estimation (NPE) validation suffers from reliance on strong classifiers and lacks finite-sample theoretical guarantees. Method: We propose Conformal C2ST—a conformalized two-sample test framework for classifier-based posterior diagnostics. Building on Hu & Lei’s conformal inference theory, it calibrates arbitrary classifier outputs into exact finite-sample p-values without requiring classifier optimality. Contribution/Results: We establish the first theoretical guarantee that Conformal C2ST achieves high statistical power and strict Type-I error control—even with weak or overfitted classifiers. Its power degradation is provably stable and robust to model misspecification. Empirically, Conformal C2ST significantly outperforms standard C2ST and other discriminative tests across multiple benchmark tasks. It is the first posterior diagnostic tool for simulation-based inference that simultaneously ensures finite-sample validity and computational practicality.
📝 Abstract
Neural Posterior Estimation (NPE) has emerged as a powerful approach for amortized Bayesian inference when the true posterior $p(θmid y)$ is intractable or difficult to sample. But evaluating the accuracy of neural posterior estimates remains challenging, with existing methods suffering from major limitations. One appealing and widely used method is the classifier two-sample test (C2ST), where a classifier is trained to distinguish samples from the true posterior $p(θmid y)$ versus the learned NPE approximation $q(θmid y)$. Yet despite the appealing simplicity of the C2ST, its theoretical and practical reliability depend upon having access to a near-Bayes-optimal classifier -- a requirement that is rarely met and, at best, difficult to verify. Thus a major open question is: can a weak classifier still be useful for neural posterior validation? We show that the answer is yes. Building on the work of Hu and Lei, we present several key results for a conformal variant of the C2ST, which converts any trained classifier's scores -- even those of weak or over-fitted models -- into exact finite-sample p-values. We establish two key theoretical properties of the conformal C2ST: (i) finite-sample Type-I error control, and (ii) non-trivial power that degrades gently in tandem with the error of the trained classifier. The upshot is that even weak, biased, or overfit classifiers can still yield powerful and reliable tests. Empirically, the Conformal C2ST outperforms classical discriminative tests across a wide range of benchmarks. These results reveal the under appreciated strength of weak classifiers for validating neural posterior estimates, establishing the conformal C2ST as a practical, theoretically grounded diagnostic for modern simulation-based inference.