🤖 AI Summary
Experimental protein–ligand complex structures are scarce, hindering data-driven drug binding affinity prediction.
Method: We propose a novel AI-based data augmentation paradigm: (i) generating synthetic complexes en masse using protein–ligand co-folding models (e.g., AlphaFold-Multimer or RoseTTAFold-All-Atom), and (ii) automatically filtering high-quality predictions via lightweight heuristic rules—based on per-residue pLDDT, interface residue confidence, and geometric plausibility—to substitute experimental structures for training machine learning scoring functions.
Contribution/Results: This work is the first to systematically demonstrate that rigorously filtered AI-predicted structures can support high-accuracy affinity modeling. On standard benchmarks (e.g., PDBbind), models trained solely on filtered synthetic data achieve performance on par with—or even surpassing—that of baselines trained on experimental structures (ΔRMSE ≤ 0.2 kcal/mol), markedly reducing reliance on experimentally determined complexes.
📝 Abstract
We evaluate the feasibility of using co-folding models for synthetic data augmentation in training machine learning-based scoring functions (MLSFs) for binding affinity prediction. Our results show that performance gains depend critically on the structural quality of augmented data. In light of this, we established simple heuristics for identifying high-quality co-folding predictions without reference structures, enabling them to substitute for experimental structures in MLSF training. Our study informs future data augmentation strategies based on co-folding models.