BindCLIP: A Unified Contrastive-Generative Representation Learning Framework for Virtual Screening

📅 2026-02-16

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

Traditional virtual screening methods struggle to accurately model fine-grained binding interactions and are often misled by spurious correlations in training data, leading to inaccurate affinity ranking. To address this, this work proposes BindCLIP, a novel framework that, for the first time, integrates pose-level generative supervision into CLIP-style contrastive learning. Specifically, a pocket-conditioned diffusion model generates binding poses to provide fine-grained supervisory signals, guiding the embedding space to focus on genuine interaction features. The approach further incorporates hard negative sampling and ligand–ligand anchoring regularization to enhance generalization. Evaluated on two public benchmarks, BindCLIP substantially outperforms strong baselines and demonstrates superior interaction-aware performance in out-of-distribution virtual screening and FEP+-based ligand analog ranking, highlighting its practical potential.

Technology Category

Application Category

📝 Abstract

Virtual screening aims to efficiently identify active ligands from massive chemical libraries for a given target pocket. Recent CLIP-style models such as DrugCLIP enable scalable virtual screening by embedding pockets and ligands into a shared space. However, our analyses indicate that such representations can be insensitive to fine-grained binding interactions and may rely on shortcut correlations in training data, limiting their ability to rank ligands by true binding compatibility. To address these issues, we propose BindCLIP, a unified contrastive-generative representation learning framework for virtual screening. BindCLIP jointly trains pocket and ligand encoders using CLIP-style contrastive learning together with a pocket-conditioned diffusion objective for binding pose generation, so that pose-level supervision directly shapes the retrieval embedding space toward interaction-relevant features. To further mitigate shortcut reliance, we introduce hard-negative augmentation and a ligand-ligand anchoring regularizer that prevents representation collapse. Experiments on two public benchmarks demonstrate consistent improvements over strong baselines. BindCLIP achieves substantial gains on challenging out-of-distribution virtual screening and improves ligand-analogue ranking on the FEP+ benchmark. Together, these results indicate that integrating generative, pose-level supervision with contrastive learning yields more interaction-aware embeddings and improves generalization in realistic screening settings, bringing virtual screening closer to real-world applicability.

Problem

Research questions and friction points this paper is trying to address.

virtual screening

binding interaction

representation learning

shortcut correlation

ligand ranking

Innovation

Methods, ideas, or system contributions that make the work stand out.

contrastive-generative learning

binding pose generation

hard-negative augmentation