🤖 AI Summary
This study addresses the challenge of efficiently identifying the optimal large language model (LLM) execution policy from a finite candidate set, balancing response quality, user experience, and operational value. Treating the LLM as a stochastic simulator, the authors propose an adaptive simulation experiment framework based on pairwise comparisons to optimize policy selection in both structured and unstructured policy spaces. They establish the first information-theoretic lower bound for LLM policy optimization, derive a closed-form optimal sampling allocation for the unstructured setting, and introduce a regularized convex optimization approach for the structured case. The proposed algorithm, LLM-PO, integrates adaptive experimental design with statistical learning theory, achieving significant performance gains over baselines while maintaining statistical validity and approaching the theoretical minimum data requirement.
📝 Abstract
Large language models (LLMs) have significant potential to improve operational efficiency in operations management. Deploying these models requires specifying a policy that governs response quality, shapes user experience, and influences operational value. In this research, we treat LLMs as stochastic simulators and propose a pairwise comparison-based adaptive simulation experiment framework for identifying the optimal policy from a finite set of candidates. We consider two policy spaces: an unstructured space with no parametric assumption, and a structured space in which the data are generated from a preference model. For both settings, we characterize the fundamental data requirements for identifying the optimal policy with high probability. In the unstructured case, we derive a closed-form expression for the optimal sampling proportions, together with a clear operational interpretation. In the structured case, we formulate a regularized convex program to compute the optimal proportions. We then develop an adaptive experimental procedure, termed LLM-PO, for both policy spaces, and prove that it identifies the optimal policy with the desired statistical guarantee while asymptotically attaining the fundamental data requirements. Numerical experiments demonstrate that LLM-PO consistently outperforms benchmark methods and improves LLM performance.