Adaptive Simulation Experiment for LLM Policy Optimization

📅 2026-04-09

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

This study addresses the challenge of efficiently identifying the optimal large language model (LLM) execution policy from a finite candidate set, balancing response quality, user experience, and operational value. Treating the LLM as a stochastic simulator, the authors propose an adaptive simulation experiment framework based on pairwise comparisons to optimize policy selection in both structured and unstructured policy spaces. They establish the first information-theoretic lower bound for LLM policy optimization, derive a closed-form optimal sampling allocation for the unstructured setting, and introduce a regularized convex optimization approach for the structured case. The proposed algorithm, LLM-PO, integrates adaptive experimental design with statistical learning theory, achieving significant performance gains over baselines while maintaining statistical validity and approaching the theoretical minimum data requirement.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) have significant potential to improve operational efficiency in operations management. Deploying these models requires specifying a policy that governs response quality, shapes user experience, and influences operational value. In this research, we treat LLMs as stochastic simulators and propose a pairwise comparison-based adaptive simulation experiment framework for identifying the optimal policy from a finite set of candidates. We consider two policy spaces: an unstructured space with no parametric assumption, and a structured space in which the data are generated from a preference model. For both settings, we characterize the fundamental data requirements for identifying the optimal policy with high probability. In the unstructured case, we derive a closed-form expression for the optimal sampling proportions, together with a clear operational interpretation. In the structured case, we formulate a regularized convex program to compute the optimal proportions. We then develop an adaptive experimental procedure, termed LLM-PO, for both policy spaces, and prove that it identifies the optimal policy with the desired statistical guarantee while asymptotically attaining the fundamental data requirements. Numerical experiments demonstrate that LLM-PO consistently outperforms benchmark methods and improves LLM performance.

Problem

Research questions and friction points this paper is trying to address.

LLM policy optimization

adaptive simulation

optimal policy identification

operations management

stochastic simulators

Innovation

Methods, ideas, or system contributions that make the work stand out.

adaptive simulation

LLM policy optimization

pairwise comparison