🤖 AI Summary
Virtual screening faces three key challenges: severe class imbalance (scarcity of active compounds), structural imbalance (overrepresentation of privileged scaffolds), and insufficient candidate molecular diversity. To address these, we propose ScaffAug—a novel scaffold-aware framework integrating generative augmentation and diversity-aware re-ranking. It employs a graph diffusion model conditioned on molecular scaffolds to generate scaffold-aligned molecules; introduces a scaffold-aware sampling algorithm to enhance generation validity; incorporates a model-agnostic self-training module to mitigate label sparsity; and applies a diversity-oriented re-ranking strategy to optimize hit lists. Evaluated across five target classes, ScaffAug consistently outperforms state-of-the-art methods, achieving significant improvements in active compound recall (+12.7%) and scaffold coverage (+34.5%). Ablation studies validate the individual contributions of each component.
📝 Abstract
Ligand-based virtual screening (VS) is an essential step in drug discovery that evaluates large chemical libraries to identify compounds that potentially bind to a therapeutic target. However, VS faces three major challenges: class imbalance due to the low active rate, structural imbalance among active molecules where certain scaffolds dominate, and the need to identify structurally diverse active compounds for novel drug development. We introduce ScaffAug, a scaffold-aware VS framework that addresses these challenges through three modules. The augmentation module first generates synthetic data conditioned on scaffolds of actual hits using generative AI, specifically a graph diffusion model. This helps mitigate the class imbalance and furthermore the structural imbalance, due to our proposed scaffold-aware sampling algorithm, designed to produce more samples for active molecules with underrepresented scaffolds. A model-agnostic self-training module is then used to safely integrate the generated synthetic data from our augmentation module with the original labeled data. Lastly, we introduce a reranking module that improves VS by enhancing scaffold diversity in the top recommended set of molecules, while still maintaining and even enhancing the overall general performance of identifying novel, active compounds. We conduct comprehensive computational experiments across five target classes, comparing ScaffAug against existing baseline methods by reporting the performance of multiple evaluation metrics and performing ablation studies on ScaffAug. Overall, this work introduces novel perspectives on effectively enhancing VS by leveraging generative augmentations, reranking, and general scaffold-awareness.