🤖 AI Summary
This study addresses the significant generalization gap observed in PROTAC activity prediction between random split and leave-one-target-out (LOTO) evaluation settings, whose underlying cause has long remained unclear. Through a variance decomposition framework, the authors demonstrate for the first time that cross-laboratory measurement variability is the dominant contributor to this gap, accounting for 0.124 AUROC—substantially exceeding the impact of binarization thresholds. To mitigate this issue, they introduce PROTAC-Bench, a standardized benchmark, along with a target-level calibration protocol incorporating the ESM-2 protein language model, SMILES deduplication, multi-seed validation, and Platt scaling. These strategies collectively improve LOTO AUROC from 0.668 to 0.705 across diverse model architectures, while also revealing that hyperparameter optimization alone cannot surpass the performance ceiling imposed by measurement noise.
📝 Abstract
Machine-learning predictors of biochemical activity often exhibit large random-split-to-leave-one-target-out generalisation gaps that have been documented but not decomposed. We frame this as an evaluation-science question and use targeted protein degradation as the empirical test bed. PROTACs (proteolysis-targeting chimeras) are heterobifunctional small molecules that induce targeted protein degradation, with more than forty candidates currently in clinical trials; published predictors report AUROC of 0.85 to 0.91 under random-split cross-validation, while the leave-one-target-out (LOTO) protocol of Ribes et al. reduces performance to approximately 0.67. Random splits reward within-target interpolation, whereas LOTO measures the novel-target prediction that de-novo design depends on. We decompose this gap and identify inter-laboratory measurement variance as the dominant component, anchored by a within-target cross-laboratory cascade bounding the inter-laboratory contribution at 0.124 AUROC, well above the 0.05 contribution from binarisation-threshold choice. Across eight published architectures and ESM-2 protein language models up to 3B parameters, LOTO AUROC plateaus near 0.67, with a comparable plateau under SMILES-level deduplication; a 21-dimensional 2000-trial hyperparameter optimisation cannot break this ceiling, and the rank-1 single-seed configuration regresses by 0.161 AUROC under multi-seed validation, matching a closed-form selection-bias prediction (Bailey and Lopez de Prado, 2014). Few-shot k=5 stratified per-target retraining combined with ADMET features lifts 65-target LOTO AUROC from 0.668 to 0.7050, and post-hoc Platt scaling recovers raw output to within the 0.05 well-calibrated threshold. We release PROTAC-Bench (10,748 measurements, 173 targets, 65 LOTO folds), the variance-decomposition framework, the per-target calibration protocol, and the evaluation code.