EVOLvE: Evaluating and Optimizing LLMs For Exploration

📅 2024-10-08

🏛️ arXiv.org

📈 Citations: 14

✨ Influential: 1

career value

204K/year

🤖 AI Summary

This work addresses the lack of active exploration and optimal decision-making under uncertainty in large language models (LLMs). It presents the first systematic evaluation of LLMs on stateless reinforcement learning—specifically, multi-armed bandit (MAB) tasks. We construct a diverse benchmark covering both context-free and context-dependent settings. To bridge the gap, we propose a dual-path optimization paradigm: (1) *algorithm-guided reasoning*, which employs structured chain-of-thought prompting to steer LLMs in executing classical bandit algorithms; and (2) *algorithm distillation*, which transfers algorithmic knowledge via synthetic-data-driven in-context learning and supervised fine-tuning. Experiments show that small-scale models trained under this paradigm significantly outperform larger base models across multiple MAB tasks. Regret-based analysis uncovers the synergistic interplay among model scale, data representation capacity, and task difficulty in shaping exploration efficacy.

Technology Category

Application Category

📝 Abstract

Despite their success in many domains, large language models (LLMs) remain under-studied in scenarios requiring optimal decision-making under uncertainty. This is crucial as many real-world applications, ranging from personalized recommendations to healthcare interventions, demand that LLMs not only predict but also actively learn to make optimal decisions through exploration. In this work, we measure LLMs' (in)ability to make optimal decisions in bandits, a state-less reinforcement learning setting relevant to many applications. We develop a comprehensive suite of environments, including both context-free and contextual bandits with varying task difficulties, to benchmark LLMs' performance. Motivated by the existence of optimal exploration algorithms, we propose efficient ways to integrate this algorithmic knowledge into LLMs: by providing explicit algorithm-guided support during inference; and through algorithm distillation via in-context demonstrations and fine-tuning, using synthetic data generated from these algorithms. Impressively, these techniques allow us to achieve superior exploration performance with smaller models, surpassing larger models on various tasks. We conducted an extensive ablation study to shed light on various factors, such as task difficulty and data representation, that influence the efficiency of LLM exploration. Additionally, we conduct a rigorous analysis of the LLM's exploration efficiency using the concept of regret, linking its ability to explore to the model size and underlying algorithm.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' decision-making under uncertainty

Optimizing LLMs for exploration in bandit scenarios

Enhancing exploration efficiency via algorithmic integration

Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrate algorithmic knowledge into LLMs

Use synthetic data for algorithm distillation

Achieve superior performance with smaller models

🔎 Similar Papers

PhaseEvo: Towards Unified In-Context Prompt Optimization for Large Language Models