Informing Acquisition Functions via Foundation Models for Molecular Discovery

📅 2025-12-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Bayesian optimization (BO) suffers from limited performance in molecular discovery due to data scarcity and the high-dimensional, combinatorially vast candidate space. Method: We propose a likelihood-free BO framework that bypasses conventional surrogate models and instead directly integrates prior knowledge from large language models (LLMs) and foundational chemistry models to guide the acquisition function. To ensure scalability, we introduce hierarchical tree-based space partitioning and coarse-grained, LLM-driven clustering, augmented by Monte Carlo tree search for accelerated optimization. Contribution/Results: This work establishes the first LLM-guided, surrogate-free BO paradigm, overcoming computational and scalability bottlenecks in million-molecule libraries. Experiments on multi-task molecular optimization benchmarks demonstrate substantial improvements in sample efficiency, robustness, and search scalability—outperforming all existing LLM-augmented BO methods.

Technology Category

Application Category

📝 Abstract
Bayesian Optimization (BO) is a key methodology for accelerating molecular discovery by estimating the mapping from molecules to their properties while seeking the optimal candidate. Typically, BO iteratively updates a probabilistic surrogate model of this mapping and optimizes acquisition functions derived from the model to guide molecule selection. However, its performance is limited in low-data regimes with insufficient prior knowledge and vast candidate spaces. Large language models (LLMs) and chemistry foundation models offer rich priors to enhance BO, but high-dimensional features, costly in-context learning, and the computational burden of deep Bayesian surrogates hinder their full utilization. To address these challenges, we propose a likelihood-free BO method that bypasses explicit surrogate modeling and directly leverages priors from general LLMs and chemistry-specific foundation models to inform acquisition functions. Our method also learns a tree-structured partition of the molecular search space with local acquisition functions, enabling efficient candidate selection via Monte Carlo Tree Search. By further incorporating coarse-grained LLM-based clustering, it substantially improves scalability to large candidate sets by restricting acquisition function evaluations to clusters with statistically higher property values. We show through extensive experiments and ablations that the proposed method substantially improves scalability, robustness, and sample efficiency in LLM-guided BO for molecular discovery.
Problem

Research questions and friction points this paper is trying to address.

Enhances Bayesian Optimization for molecular discovery using foundation models
Addresses low-data limitations and high computational costs in optimization
Improves scalability and efficiency in large molecular search spaces
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages foundation models to inform acquisition functions directly
Uses tree-structured search space partitioning with Monte Carlo Tree Search
Incorporates LLM-based clustering to improve scalability and efficiency
🔎 Similar Papers
No similar papers found.
Q
Qi Chen
Department of Computer Science, University of Toronto
Fabio Ramos
Fabio Ramos
University of Sydney and NVIDIA
roboticsmachine learning
A
Alán Aspuru-Guzik
Department of Computer Science, University of Toronto, Department of Chemical Engineering & Applied Chemistry, University of Toronto, Senior Fellow, Canadian Institute for Advanced Research (CIFAR)
Florian Shkurti
Florian Shkurti
Assistant Professor, Computer Science, University of Toronto
RoboticsMachine LearningComputer VisionArtificial Intelligence