🤖 AI Summary
Efficiently retrieving structurally diverse yet biologically similar (i.e., potency-similar) molecules from ultra-large-scale chemical libraries remains a critical challenge in reverse drug discovery.
Method: We propose a target-agnostic, potency-driven small-molecule search engine. Our approach introduces a novel potency-oriented molecular representation paradigm—decoupling similarity assessment from target-specific information. It leverages large-model-pretrained potency embeddings and accelerates similarity search via processor-level SIMD instruction optimization. Further, we design a target-free contrastive learning framework to enhance generalization across diverse bioactivity contexts.
Results: Evaluated on the 40-billion-molecule Enamine REAL library, our method achieves millisecond-scale latency with 100% recall—significantly outperforming state-of-the-art baselines. To our knowledge, this is the first work enabling real-time, high-fidelity potency-similarity retrieval over an exascale (10¹⁸) molecular space, establishing a scalable, AI-powered paradigm for target-agnostic reverse drug discovery.
📝 Abstract
Recent successes in virtual screening have been made possible by large models and extensive chemical libraries. However, combining these elements is challenging: the larger the model, the more expensive it is to run, making ultra-large libraries unfeasible. To address this, we developed a target-agnostic, efficacy-based molecule search model, which allows us to find structurally dissimilar molecules with similar biological activities. We used the best practices to design fast retrieval system, based on processor-optimized SIMD instructions, enabling us to screen the ultra-large 40B Enamine REAL library with 100% recall rate. We extensively benchmarked our model and several state-of-the-art models for both speed performance and retrieval quality of novel molecules.