🤖 AI Summary
This work proposes Pretrained Embedding Distance (PED), a general and tuning-free molecular similarity metric that leverages distances in the embedding space of pretrained molecular models. Traditional similarity measures often rely on handcrafted features or incur high computational costs, while existing deep learning approaches typically require task-specific supervision or large amounts of labeled data, limiting their generalizability. In contrast, PED eliminates the need for both manual feature engineering and task-specific fine-tuning. It effectively ranks active compounds in virtual screening and successfully guides goal-directed molecular generation. Experimental results demonstrate that PED exhibits strong correlation with conventional similarity metrics across multiple tasks, while offering superior scalability and practical utility.
📝 Abstract
Molecular similarity plays a central role in ligand-based drug discovery, such as virtual screening, analog searching, and goal-directed molecular generation. However, traditional similarity measures, ranging from fingerprint-based Tanimoto coefficients to 3D shape overlays, are often computationally expensive at scale or rely on hand-crafted molecular descriptors. Meanwhile, many deep learning approaches to similarity-aware design still depend on similarity-specific supervision or costly data curation, limiting their generality across targets. In this work, we propose pretrained embedding distance (PED) as an effective alternative, computed directly from pretrained molecular models without task-specific training. Experimental results show that PED exhibits distinct correlations with traditional similarity metrics, and performs effectively in both ranking molecules for virtual screening and guiding molecular generation via reward design. These findings suggest that pretrained molecular embeddings capture rich structural information and can serve as a promising and scalable similarity measurement for modern AI-aided drug discovery.