🤖 AI Summary
DEL data scarcity severely hinders machine learning–driven drug discovery. To address this, we introduce KinDEL—the first large-scale, publicly available DEL dataset targeting two therapeutically relevant kinases (MAPK14 and DDR1)—featuring paired on-DNA screening and off-DNA biophysical validation data, thereby filling a critical gap. Methodologically, we integrate high-throughput DEL screening, deep sequencing, molecular graph neural networks, and SE(3)-equivariant structure-aware probabilistic modeling, all validated experimentally. Models trained on KinDEL achieve AUC > 0.89; crucially, on-DNA predictions correlate strongly with off-DNA binding affinities (Pearson *r* = 0.72), confirming both dataset quality and model generalizability. KinDEL and its associated computational framework advance DEL modeling from empirical heuristics toward structure- and mechanism-informed prediction, enabling more reliable hit identification and target engagement assessment.
📝 Abstract
DNA-Encoded Libraries (DEL) are combinatorial small molecule libraries that offer an efficient way to characterize diverse chemical spaces. Selection experiments using DELs are pivotal to drug discovery efforts, enabling high-throughput screens for hit finding. However, limited availability of public DEL datasets hinders the advancement of computational techniques designed to process such data. To bridge this gap, we present KinDEL, one of the first large, publicly available DEL datasets on two kinases: Mitogen-Activated Protein Kinase 14 (MAPK14) and Discoidin Domain Receptor Tyrosine Kinase 1 (DDR1). Interest in this data modality is growing due to its ability to generate extensive supervised chemical data that densely samples around select molecular structures. Demonstrating one such application of the data, we benchmark different machine learning techniques to develop predictive models for hit identification; in particular, we highlight recent structure-based probabilistic approaches. Finally, we provide biophysical assay data, both on- and off-DNA, to validate our models on a smaller subset of molecules. Data and code for our benchmarks can be found at: https://github.com/insitro/kindel.