🤖 AI Summary
AI-driven drug design often yields high-risk molecules—e.g., over 60% of proposals in GuacaMol exhibit mutagenicity—due to insufficient experimental prior knowledge. To address this, we propose the first systematic framework for automatically extracting therapeutic design priors from full-text scientific literature. Leveraging an LLM-CLIP-LLaVA multimodal pipeline, it jointly extracts natural-language facts and structured representations (e.g., SMILES, RefSeq) to construct a large-scale, computable knowledge dataset comprising 32.3 million text–molecule/sequence pairs. This dataset substantially enhances prior-guided reasoning in small models: a 15-million-parameter model, after pretraining, outperforms the 2-billion-parameter TxGemma on Therapeutic Data Commons benchmarks and approaches the performance of a 9-billion-parameter model, while markedly reducing toxicity risk in GuacaMol molecule generation.
📝 Abstract
AI-driven discovery can greatly reduce design time and enhance new therapeutics' effectiveness. Models using simulators explore broad design spaces but risk violating implicit constraints due to a lack of experimental priors. For example, in a new analysis we performed on a diverse set of models on the GuacaMol benchmark using supervised classifiers, over 60% of molecules proposed had high probability of being mutagenic. In this work, we introduce ourdataset, a dataset of priors for design problems extracted from literature describing compounds used in lab settings. It is constructed with LLM pipelines for discovering therapeutic entities in relevant paragraphs and summarizing information in concise fair-use facts. ourdataset~ consists of 32.3 million pairs of natural language facts, and appropriate entity representations (i.e. SMILES or refseq IDs). To demonstrate the potential of the data, we train LLM, CLIP, and LLava architectures to reason jointly about text and design targets and evaluate on tasks from the Therapeutic Data Commons (TDC). ourdataset~is highly effective for creating models with strong priors: in supervised prediction problems that use our data as pretraining, our best models with 15M learnable parameters outperform larger 2B TxGemma on both regression and classification TDC tasks, and perform comparably to 9B models on average. Models built with ourdataset~can be used as constraints while optimizing for novel molecules in GuacaMol, resulting in proposals that are safer and nearly as effective. We release our dataset at href{https://huggingface.co/datasets/medexanon/Medex}{huggingface.co/datasets/medexanon/Medex}, and will provide expanded versions as available literature grows.