An Exploratory Study of Bayesian Prompt Optimization for Test-Driven Code Generation with Large Language Models

📅 2025-12-16

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

This work addresses the challenge of prompt optimization in test-driven code generation with large language models (LLMs). Methodologically, it introduces the first Bayesian optimization (BO) framework for prompt engineering: an auxiliary LLM maps discrete prompts into a continuous embedding space, enabling construction of a Gaussian process surrogate model; random projection and dimensionality-scaling priors are incorporated to mitigate degradation in high-dimensional modeling. Optimization leverages execution feedback from test cases to adaptively search for prompts maximizing functional correctness. Evaluated on HumanEval+, the approach significantly outperforms fixed and manually tuned prompts across multiple base LLMs, improving code generation accuracy with rapid convergence—often within few BO iterations. The core contributions are (1) establishing a differentiable prompt optimization paradigm, and (2) enhancing the stability and efficiency of BO in high-dimensional embedding spaces.

Technology Category

Application Category

📝 Abstract

We consider the task of generating functionally correct code using large language models (LLMs). The correctness of generated code is influenced by the prompt used to query the given base LLM. We formulate the problem of finding the appropriate prompt as combinatorial search process and propose a Bayesian optimization (BO) approach referred to as {em BO for Code GENeration (BODE-GEN)}. BODE-GEN performs an adaptive data-driven search over prompts guided by training data in the form of prompts tried and the functional accuracy of the generated code over a set of given test cases. The key insight is to perform BO in continuous embedding space by using an auxiliary LLM to bridge the gap between discrete prompt space and continuous embedding space. We leverage two synergistic ideas, namely, random projections and dimensionality scaled priors, to build effective Gaussian process based surrogate models over the high-dimensional embedding space. Our experiments on the HumanEval+ benchmark using multiple base LLMs show that BODE-GEN can improve performance in terms of code generation accuracy compared to fixed prompts and manual prompt engineering. Additionally, we demonstrate that BODE-GEN is sample-efficient, requiring relatively few iterations of BO to demonstrate improvements in code accuracy.

Problem

Research questions and friction points this paper is trying to address.

Optimizing prompts for correct code generation with LLMs

Using Bayesian optimization to search prompt space efficiently

Improving code accuracy via adaptive data-driven prompt selection

Innovation

Methods, ideas, or system contributions that make the work stand out.

Bayesian optimization adaptively searches prompt space

Uses auxiliary LLM to map prompts to embeddings

Employs Gaussian processes with random projections

🔎 Similar Papers

Enhancing Large Language Models for Text-to-Testcase Generation