๐ค AI Summary
Traditional virtual screening methods struggle to accurately model fine-grained binding interactions and are often misled by spurious correlations in training data, leading to inaccurate affinity ranking. To address this, this work proposes BindCLIP, a novel framework that, for the first time, integrates pose-level generative supervision into CLIP-style contrastive learning. Specifically, a pocket-conditioned diffusion model generates binding poses to provide fine-grained supervisory signals, guiding the embedding space to focus on genuine interaction features. The approach further incorporates hard negative sampling and ligandโligand anchoring regularization to enhance generalization. Evaluated on two public benchmarks, BindCLIP substantially outperforms strong baselines and demonstrates superior interaction-aware performance in out-of-distribution virtual screening and FEP+-based ligand analog ranking, highlighting its practical potential.
๐ Abstract
Virtual screening aims to efficiently identify active ligands from massive chemical libraries for a given target pocket. Recent CLIP-style models such as DrugCLIP enable scalable virtual screening by embedding pockets and ligands into a shared space. However, our analyses indicate that such representations can be insensitive to fine-grained binding interactions and may rely on shortcut correlations in training data, limiting their ability to rank ligands by true binding compatibility. To address these issues, we propose BindCLIP, a unified contrastive-generative representation learning framework for virtual screening. BindCLIP jointly trains pocket and ligand encoders using CLIP-style contrastive learning together with a pocket-conditioned diffusion objective for binding pose generation, so that pose-level supervision directly shapes the retrieval embedding space toward interaction-relevant features. To further mitigate shortcut reliance, we introduce hard-negative augmentation and a ligand-ligand anchoring regularizer that prevents representation collapse. Experiments on two public benchmarks demonstrate consistent improvements over strong baselines. BindCLIP achieves substantial gains on challenging out-of-distribution virtual screening and improves ligand-analogue ranking on the FEP+ benchmark. Together, these results indicate that integrating generative, pose-level supervision with contrastive learning yields more interaction-aware embeddings and improves generalization in realistic screening settings, bringing virtual screening closer to real-world applicability.