Decomposing the Generalization Gap in PROTAC Activity Prediction: Variance Attribution and the Inter-Laboratory Ceiling

📅 2026-05-12
📈 Citations: 0
Influential: 0
📄 PDF

career value

183K/year
🤖 AI Summary
This study addresses the significant generalization gap observed in PROTAC activity prediction between random split and leave-one-target-out (LOTO) evaluation settings, whose underlying cause has long remained unclear. Through a variance decomposition framework, the authors demonstrate for the first time that cross-laboratory measurement variability is the dominant contributor to this gap, accounting for 0.124 AUROC—substantially exceeding the impact of binarization thresholds. To mitigate this issue, they introduce PROTAC-Bench, a standardized benchmark, along with a target-level calibration protocol incorporating the ESM-2 protein language model, SMILES deduplication, multi-seed validation, and Platt scaling. These strategies collectively improve LOTO AUROC from 0.668 to 0.705 across diverse model architectures, while also revealing that hyperparameter optimization alone cannot surpass the performance ceiling imposed by measurement noise.
📝 Abstract
Machine-learning predictors of biochemical activity often exhibit large random-split-to-leave-one-target-out generalisation gaps that have been documented but not decomposed. We frame this as an evaluation-science question and use targeted protein degradation as the empirical test bed. PROTACs (proteolysis-targeting chimeras) are heterobifunctional small molecules that induce targeted protein degradation, with more than forty candidates currently in clinical trials; published predictors report AUROC of 0.85 to 0.91 under random-split cross-validation, while the leave-one-target-out (LOTO) protocol of Ribes et al. reduces performance to approximately 0.67. Random splits reward within-target interpolation, whereas LOTO measures the novel-target prediction that de-novo design depends on. We decompose this gap and identify inter-laboratory measurement variance as the dominant component, anchored by a within-target cross-laboratory cascade bounding the inter-laboratory contribution at 0.124 AUROC, well above the 0.05 contribution from binarisation-threshold choice. Across eight published architectures and ESM-2 protein language models up to 3B parameters, LOTO AUROC plateaus near 0.67, with a comparable plateau under SMILES-level deduplication; a 21-dimensional 2000-trial hyperparameter optimisation cannot break this ceiling, and the rank-1 single-seed configuration regresses by 0.161 AUROC under multi-seed validation, matching a closed-form selection-bias prediction (Bailey and Lopez de Prado, 2014). Few-shot k=5 stratified per-target retraining combined with ADMET features lifts 65-target LOTO AUROC from 0.668 to 0.7050, and post-hoc Platt scaling recovers raw output to within the 0.05 well-calibrated threshold. We release PROTAC-Bench (10,748 measurements, 173 targets, 65 LOTO folds), the variance-decomposition framework, the per-target calibration protocol, and the evaluation code.
Problem

Research questions and friction points this paper is trying to address.

generalization gap
PROTAC
leave-one-target-out
inter-laboratory variance
activity prediction
Innovation

Methods, ideas, or system contributions that make the work stand out.

generalization gap decomposition
inter-laboratory variance
leave-one-target-out prediction
PROTAC activity prediction
few-shot retraining
🔎 Similar Papers
No similar papers found.
💼 Related Jobs
Postdoctoral Fellow – AI-Driven Multi-Omics Integration for Predictive Toxicology
Pfizer
The annual base salary for this position ranges from $64,600.00 to $107,600.00. In addition, this position is eligible for participation in Pfizer’s Global Performance Plan with a bonus target of 7.5% of the base salary. We offer comprehensive and generous benefits and programs to help our colleagues lead healthy lives and to support each of life’s moments. Benefits offered include a 401(k) plan with Pfizer Matching Contributions and an additional Pfizer Retirement Savings Contribution, paid vacation, holiday and personal days, paid caregiver/parental and medical leave, and health benefits to include medical, prescription drug, dental and vision coverage. Learn more at Pfizer Candidate Site – U.S. Benefits | (uscandidates.mypfizerbenefits.com). Pfizer compensation structures and benefit packages are aligned based on the location of hire. The United States salary range provided does not apply to Tampa, FL or any location outside of the United States. Relocation assistance may be available based on business needs and/or eligibility.
Hybrid
Postdoctoral Fellow – AI/ML Enabled Bioprocess Modeling and Control
Pfizer
The annual base salary for this position ranges from $64,600.00 to $107,600.00. In addition, this position is eligible for participation in Pfizer’s Global Performance Plan with a bonus target of 7.5% of the base salary. We offer comprehensive and generous benefits and programs to help our colleagues lead healthy lives and to support each of life’s moments. Benefits offered include a 401(k) plan with Pfizer Matching Contributions and an additional Pfizer Retirement Savings Contribution, paid vacation, holiday and personal days, paid caregiver/parental and medical leave, and health benefits to include medical, prescription drug, dental and vision coverage. Learn more at Pfizer Candidate Site – U.S. Benefits | (uscandidates.mypfizerbenefits.com). Pfizer compensation structures and benefit packages are aligned based on the location of hire. The United States salary range provided does not apply to Tampa, FL or any location outside of the United States. Relocation assistance may be available based on business needs and/or eligibility.
United States - Massachusetts - Andover
T
Thor Klamt
L3S Research Center, Leibniz Universität Hannover, Appelstraße 9a, 30167 Hannover, Germany
Wolfgang Nejdl
Wolfgang Nejdl
Professor of Computer Science, Leibniz Universität Hannover, L3S Research Center, Hannover, Germany
Information RetrievalWeb ScienceSocial MediaData MiningSemantic Technologies
M
Ming Tang
L3S Research Center, Leibniz Universität Hannover, Appelstraße 9a, 30167 Hannover, Germany; Institute of Data Science (Knowledge-Based Systems), Faculty of Electrical Engineering and Computer Science, Leibniz Universität Hannover, Appelstraße 9a, 30167 Hannover, Germany