Comparing Exploration-Exploitation Strategies of LLMs and Humans: Insights from Standard Multi-armed Bandit Tasks

📅 2025-05-15

📈 Citations: 0

✨ Influential: 0

career value

229K/year

🤖 AI Summary

This study investigates whether large language models (LLMs) exhibit human-like decision-making behavior and performance in the fundamental sequential decision-making problem of exploration–exploitation (E&E) trade-offs. Methodologically, it introduces the cognitive science–standard multi-armed bandit (MAB) paradigm to enable controlled, cross-subject comparison of E&E strategies across LLMs, human participants, and classical algorithms, integrating interpretable choice models (e.g., softmax, UCB variants) and reasoning-augmentation techniques (e.g., Chain-of-Thought, Self-Refine). Results demonstrate that explicit reasoning significantly enhances LLMs’ behavioral alignment with humans—particularly in hybrid random-and-directed exploration patterns—and yields near-human exploration dynamics in stationary environments. However, in non-stationary environments, LLMs remain markedly inferior to humans in dynamically adapting directed exploration, revealing a critical limitation in their online learning capability. These findings establish the MAB framework as a rigorous benchmark for evaluating LLM decision-making and uncover fundamental gaps between current LLMs and human adaptive cognition.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) are increasingly used to simulate or automate human behavior in complex sequential decision-making tasks. A natural question is then whether LLMs exhibit similar decision-making behavior to humans, and can achieve comparable (or superior) performance. In this work, we focus on the exploration-exploitation (E&E) tradeoff, a fundamental aspect of dynamic decision-making under uncertainty. We employ canonical multi-armed bandit (MAB) tasks introduced in the cognitive science and psychiatry literature to conduct a comparative study of the E&E strategies of LLMs, humans, and MAB algorithms. We use interpretable choice models to capture the E&E strategies of the agents and investigate how explicit reasoning, through both prompting strategies and reasoning-enhanced models, shapes LLM decision-making. We find that reasoning shifts LLMs toward more human-like behavior, characterized by a mix of random and directed exploration. In simple stationary tasks, reasoning-enabled LLMs exhibit similar levels of random and directed exploration compared to humans. However, in more complex, non-stationary environments, LLMs struggle to match human adaptability, particularly in effective directed exploration, despite achieving similar regret in certain scenarios. Our findings highlight both the promise and limits of LLMs as simulators of human behavior and tools for automated decision-making and point to potential areas of improvements.

Problem

Research questions and friction points this paper is trying to address.

Compare E&E strategies of LLMs and humans in MAB tasks

Assess impact of reasoning on LLM decision-making adaptability

Evaluate LLMs as human behavior simulators in dynamic tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Use multi-armed bandit tasks for LLM-human comparison

Apply interpretable choice models to analyze strategies

Enhance LLMs with reasoning for human-like exploration

🔎 Similar Papers

EVOLvE: Evaluating and Optimizing LLMs For Exploration