๐ค AI Summary
This work addresses the challenge of efficiently optimizing system prompts when only aggregate metric feedback is available, without access to per-sample labels or error signals. The authors propose ReElicit, a novel framework that leverages large language models to dynamically construct an interpretable semantic embedding space, which is jointly refined through iterative collaboration with Gaussian process-based Bayesian optimization. In this approach, an acquisition function guides prompt generation, while the feature space is adaptively reconstructed based on the optimization trajectory. Evaluated across ten system prompt optimization tasks, ReElicit achieves state-of-the-art overall performance, significantly outperforming existing baselines that rely solely on aggregate feedbackโdespite using only 30 evaluation queries per task.
๐ Abstract
System prompts are a central control mechanism in modern AI systems, shaping behavior across conversations, tasks, and user populations. Yet they are difficult to tune when feedback is available only as aggregate metrics rather than per-example labels, failures, or critiques. We study this aggregate feedback setting as sample-constrained black-box optimization over discrete, variable-length text. We introduce ReElicit, a Bayesian optimization framework based on \emph{embedding by elicitation}. Given a task description, previously evaluated prompts, and scalar scores, an LLM elicits a compact, interpretable feature space and maps prompts into it. Leveraging a probabilistic Gaussian process surrogate, an acquisition function then selects target feature vectors, which the LLM realizes and refines into deployable system prompts. Re-eliciting the feature space as new evaluations arrive lets the representation adapt to the observed prompt-score history. We evaluate the setting using offline benchmark accuracy as a controlled aggregate proxy: the optimizer observes one scalar score per prompt and no per-example labels, errors, or critiques. Across ten system prompt optimization tasks with a 30 total evaluation budget, ReElicit achieves the strongest aggregate performance profile among representative aggregate-only prompt-optimization baselines. These results suggest that LLMs can serve as adaptive semantic representation builders, not only prompt generators, for Bayesian optimization over natural-language artifacts.