🤖 AI Summary
This work addresses prediction churn—the undesirable variability in model predictions across different training data samples—which critically undermines reliability in scientific machine learning. The study presents the first systematic formulation and quantification of this issue and introduces Twin-Bootstrap, a novel co-training strategy that constructs two independent bootstrap subsets via guided sampling and enforces output consistency between dual networks through a symmetric KL divergence loss. With only a 2× computational overhead, Twin-Bootstrap substantially mitigates prediction churn while preserving predictive accuracy. Evaluated on nine chemical benchmarks, it reduces the median prediction churn by 45% compared to standard Bagging (K=2) and achieves an overall churn reduction of 40–54%, significantly outperforming conventional parameter-based regularization methods and baseline ensemble approaches.
📝 Abstract
Scientific machine learning reports predictive performance. It does not report whether the same prediction would survive a different draw of training data. Across $9$ chemistry benchmarks, two classifiers trained on independent bootstraps of the same training set agree on aggregate accuracy to within $1.3\text{--}4.2$ percentage points but disagree on the class label of $8.0\text{--}21.8\%$ of test molecules. We call this gap \emph{cross-sample prediction churn}. The standard parameter-side techniques (deep ensembles, MC dropout, stochastic weight averaging) do not reduce this gap; two data-side methods do. The first is $K$-bootstrap bagging, which cuts the rate $40\text{--}54\%$ on every dataset at no accuracy cost ($K{\times}$-ERM compute). The second is \emph{twin-bootstrap}, our proposal: two networks trained jointly on independent bootstraps with a sym-KL consistency loss between their predictions, which at matched $2{\times}$-ERM compute reduces churn a further median $45\%$ beyond bagging-$K{=}2$. Cross-sample prediction churn deserves a column alongside predictive performance in scientific-ML benchmark reports, because without it the parameter-side and data-side methods are indistinguishable on the metric they actually differ on.