Bandit Social Learning: Exploration under Myopic Behavior

📅 2023-02-15
🏛️ arXiv.org
📈 Citations: 4
Influential: 0
📄 PDF
🤖 AI Summary
This paper investigates collective learning failure among myopic agents—employing greedy, exploration-free policies—in the multi-armed bandit (MAB) framework for social learning. Under a sequential decision-making setting with no private signals—where agents rely solely on shared history of actions and rewards—we establish, for the first time, that moderately myopic greedy strategies (e.g., ε-greedy or UCB variants with confidence intervals) incur linear regret. We precisely characterize the phase-transition threshold between myopia severity and exploratory capacity. By integrating social learning dynamics modeling with refined regret analysis, we derive tight upper and lower bounds, revealing general conditions under which greedy algorithms systematically fail. A key theoretical contribution is proving that “moderate optimism”—formalized as appropriately calibrated upper-confidence bonuses—is both necessary and sufficient to restore logarithmic regret. This provides a rigorous foundation for designing distributed learning protocols endowed with provably effective active exploration.
📝 Abstract
We study social learning dynamics where the agents collectively follow a simple multi-armed bandit protocol. Agents arrive sequentially, choose arms and receive associated rewards. Each agent observes the full history (arms and rewards) of the previous agents, and there are no private signals. While collectively the agents face exploration-exploitation tradeoff, each agent acts myopically, without regards to exploration. Motivating scenarios concern reviews and ratings on online platforms. We allow a wide range of myopic behaviors that are consistent with (parameterized) confidence intervals, including the"unbiased"behavior as well as various behaviorial biases. While extreme versions of these behaviors correspond to well-known bandit algorithms, we prove that more moderate versions lead to stark exploration failures, and consequently to regret rates that are linear in the number of agents. We provide matching upper bounds on regret by analyzing"moderately optimistic"agents. As a special case of independent interest, we obtain a general result on failure of the greedy algorithm in multi-armed bandits. This is the first such result in the literature, to the best of our knowledge.
Problem

Research questions and friction points this paper is trying to address.

Analyzing learning failures in myopic bandit algorithms
Studying exploration under behavioral biases in social learning
Providing theoretical foundation for intentional exploration algorithms
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-armed bandit protocol with myopic agents
Parameterized confidence intervals for reward estimation
Optimism-pessimism behavioral model analysis
🔎 Similar Papers
2024-05-25arXiv.orgCitations: 1