POETS: Uncertainty-Aware LLM Optimization via Compute-Efficient Policy Ensembles

📅 2026-05-08

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

This work addresses the challenge of balancing exploration and exploitation in sequential decision-making and black-box optimization, particularly the lack of efficient, uncertainty-aware methods for large language model (LLM) optimization. The authors propose POETS, a novel framework that uniquely integrates Thompson sampling with KL-regularized policy distillation. Leveraging a lightweight architecture featuring a shared pretrained backbone and task-specific LoRA branches, POETS directly models epistemic uncertainty, implicitly encodes the reward function, and performs policy matching using online bootstrapped data—bypassing conventional two-stage training pipelines. Theoretically, it achieves a cumulative regret bound of $\mathcal{O}(\sqrt{T \gamma_T})$. Empirically, POETS demonstrates state-of-the-art sample efficiency in scientific discovery tasks such as protein design and quantum circuit optimization, while significantly enhancing optimization stability in low-data regimes and off-policy reinforcement learning settings.

📝 Abstract

Balancing exploration and exploitation is a core challenge in sequential decision-making and black-box optimization. We introduce POETS ($\textbf{Po}$licy $\textbf{E}$nsembles for $\textbf{T}$hompson $\textbf{S}$ampling), a novel framework that bridges uncertainty quantification and policy optimization. Our approach is grounded in the insight that policies trained with Kullback-Leibler (KL) regularization implicitly encode an underlying reward function. Building on this, POETS bypasses the complex, nested process of training an uncertainty-aware reward model and separately fitting a policy to this model. Instead, we directly train a policy ensemble to capture epistemic uncertainty by matching implicitly encoded reward functions to online, bootstrapped data. To overcome the prohibitive compute and memory constraints of ensembling Large Language Models (LLMs), POETS utilizes an efficient architecture: the ensemble shares a pre-trained backbone while maintaining diversity through independent Low-Rank Adaptation (LoRA) branches. Theoretically, we prove that POETS implicitly conducts KL-regularized Thompson sampling and thus inherits strong cumulative regret bounds of ${\mathcal O}(\sqrt{T γ_T})$. Empirically, we demonstrate that POETS achieves state-of-the-art sample efficiency across diverse scientific discovery domains, including protein search and quantum circuit design. Furthermore, it improves the optimization trajectories of reinforcement learning, proving particularly robust in off-policy settings with experience replay or in small dataset regimes.

Problem

Research questions and friction points this paper is trying to address.

exploration-exploitation trade-off

black-box optimization

uncertainty quantification

Large Language Models

policy optimization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Policy Ensembles

Thompson Sampling

KL Regularization