Predicting from Strings: Language Model Embeddings for Bayesian Optimization

📅 2024-10-14

🏛️ arXiv.org

📈 Citations: 11

✨ Influential: 3

career value

176K/year

🤖 AI Summary

Existing Bayesian optimization (BO) methods rely on regression models such as Gaussian processes, which assume fixed numerical features and structured search spaces—rendering them unsuitable for black-box optimization over string-valued inputs. To address this limitation, we propose the *Embed-then-Regress* paradigm, the first to integrate pretrained language models (e.g., T5) into BO by leveraging their capacity to map strings to continuous embeddings and performing context-aware regression in the embedding space. This approach eliminates reliance on structured inputs or predefined search spaces, enabling unified treatment of diverse tasks—including molecular synthesis, combinatorial design, and hyperparameter tuning. Empirical evaluation across multiple benchmarks demonstrates performance competitive with state-of-the-art Gaussian process–based BO methods. Our framework establishes an end-to-end, cross-domain BO solution for unstructured, string-valued inputs.

Technology Category

Application Category

📝 Abstract

Bayesian Optimization is ubiquitous in the field of experimental design and blackbox optimization for improving search efficiency, but has been traditionally restricted to regression models which are only applicable to fixed search spaces and tabular input features. We propose Embed-then-Regress, a paradigm for applying in-context regression over string inputs, through the use of string embedding capabilities of pretrained language models. By expressing all inputs as strings, we are able to perform general-purpose regression for Bayesian Optimization over various domains including synthetic, combinatorial, and hyperparameter optimization, obtaining comparable results to state-of-the-art Gaussian Process-based algorithms. Code can be found at https://github.com/google-research/optformer/tree/main/optformer/embed_then_regress.

Problem

Research questions and friction points this paper is trying to address.

LLM embeddings enable Bayesian Optimization with string inputs

General-purpose regression across diverse optimization domains achieved

Performance comparable to state-of-the-art Gaussian Process methods

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses LLM embeddings for Bayesian Optimization regression

Represents diverse inputs as strings for flexibility

Achieves performance comparable to Gaussian Process methods

🔎 Similar Papers

OmniPred: Language Models as Universal Regressors