When Greedy Wins: Emergent Exploitation Bias in Meta-Bandit LLM Training

📅 2025-09-29

📈 Citations: 0

✨ Influential: 0

career value

233K/year

🤖 AI Summary

Large language models (LLMs) exhibit insufficient exploration in sequential decision-making, leading to suboptimal performance on multi-armed bandit (MAB) tasks. Method: We systematically investigate how supervised fine-tuning (SFT) and reinforcement learning (RL) affect the exploration–exploitation trade-off, revealing that both approaches induce more complex and greedy exploitation strategies—sometimes surpassing UCB-based teachers—but exacerbate premature convergence. To mitigate this, we propose regret-shaping rewards and algorithmic rewards to enable precise imitation of oracle behavior. Contribution/Results: The resulting agents match or exceed classical UCB and Thompson sampling on standard MAB benchmarks, generalize robustly to horizons six times longer, and transfer across distinct bandit families. This demonstrates strong robustness and adaptability. Our work uncovers the dual nature of greediness in LLM-based decision-making and underscores the necessity of tailored reward design and evaluation metrics to ensure reliable exploration.

Technology Category

Application Category

📝 Abstract

While Large Language Models (LLMs) hold promise to become autonomous agents, they often explore suboptimally in sequential decision-making. Recent work has sought to enhance this capability via supervised fine-tuning (SFT) or reinforcement learning (RL), improving regret on the classic multi-armed bandit task. However, it remains unclear how these learning methods shape exploration strategies and how well they generalize. We investigate both paradigms by training LLMs with SFT on expert trajectories and RL with a range of tailored reward signals including a strategic, regret-shaped reward to reduce variance, and an algorithmic reward that enables oracle imitation. The resulting agents outperform pre-trained models and achieve performance comparable to Upper Confidence Bound (UCB) and Thompson Sampling, with robust generalization to 6x longer horizons and across bandit families. Behavioral analysis reveals that gains often stem from more sophisticated but greedier exploitation: RL/SFT agents are more prone to early catastrophic failure than pre-trained models, prematurely abandoning exploration. Furthermore, agents trained to imitate UCB learn to outperform their teacher by adopting more exploitative variants. Our findings clarify when each training paradigm is preferable and advocate tailored reward design and evaluation beyond average regret to promote robust exploratory behavior.

Problem

Research questions and friction points this paper is trying to address.

Investigating how SFT and RL shape LLM exploration strategies in bandit tasks

Analyzing generalization of trained agents across longer horizons and bandit families

Evaluating emergent exploitation bias and catastrophic failure risks in meta-bandit training

Innovation

Methods, ideas, or system contributions that make the work stand out.

Combining supervised fine-tuning with expert trajectories

Using tailored reward signals to reduce variance

Enabling oracle imitation through algorithmic reward design

🔎 Similar Papers

AI AI Bias: Large Language Models Favor Their Own Generated Content