🤖 AI Summary
This study systematically investigates the capability boundaries of large language models (LLMs) in addressing the exploration-exploitation trade-off, focusing on sequential decision-making tasks such as contextual multi-armed bandits. Methodologically, it integrates prompt-engineering-driven in-context learning, semantic action-space modeling, and comparative evaluation against linear regression baselines. Key results reveal a pronounced asymmetry: LLMs exhibit strong semantics-guided exploration—significantly outperforming random sampling in high-dimensional, semantically rich action spaces—but demonstrate weak exploitation performance, even underperforming simple linear models. To our knowledge, this work is the first to empirically identify and characterize this asymmetry. It further proposes an in-context mitigation strategy tailored to small-scale tasks, which partially improves exploitation accuracy. Collectively, these findings offer novel insights into the decision-making mechanisms of LLMs and advance understanding of their controllability and limitations in adaptive sequential decision contexts.
📝 Abstract
We evaluate the ability of the current generation of large language models (LLMs) to help a decision-making agent facing an exploration-exploitation tradeoff. We use LLMs to explore and exploit in silos in various (contextual) bandit tasks. We find that while the current LLMs often struggle to exploit, in-context mitigations may be used to substantially improve performance for small-scale tasks. However even then, LLMs perform worse than a simple linear regression. On the other hand, we find that LLMs do help at exploring large action spaces with inherent semantics, by suggesting suitable candidates to explore.