Principal-Agent Bandit Games with Self-Interested and Exploratory Learning Agents

📅 2024-12-20

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

260K/year

🤖 AI Summary

This paper studies incentive mechanism design in principal-agent multi-armed bandit games, where the agent must autonomously explore an unknown environment—rather than being fully informed and greedy. Addressing the realistic setting where agents are self-interested and exploration is inherently uncertain, we propose the first robust elimination framework tailored to exploration-aware learning agents. Our method integrates adaptive search, robust incentive design, and a coupled Bayesian/frequentist estimation scheme with explicit exploration. Under i.i.d. rewards, it achieves a regret bound of $widetilde{O}(T^{2/3})$; under linear rewards, it attains the optimal $widetilde{O}(sqrt{T})$ bound—improving significantly over Dogan et al. (2023a)’s $widetilde{O}(T^{11/12})$. This provides the first theoretically optimal solution for principals to dynamically guide learning agents in exploration-exploitation settings.

Technology Category

Application Category

📝 Abstract

We study the repeated principal-agent bandit game, where the principal indirectly interacts with the unknown environment by proposing incentives for the agent to play arms. Most existing work assumes the agent has full knowledge of the reward means and always behaves greedily, but in many online marketplaces, the agent needs to learn the unknown environment and sometimes explore. Motivated by such settings, we model a self-interested learning agent with exploration behaviors who iteratively updates reward estimates and either selects an arm that maximizes the estimated reward plus incentive or explores arbitrarily with a certain probability. As a warm-up, we first consider a self-interested learning agent without exploration. We propose algorithms for both i.i.d. and linear reward settings with bandit feedback in a finite horizon $T$, achieving regret bounds of $widetilde{O}(sqrt{T})$ and $widetilde{O}( T^{2/3} )$, respectively. Specifically, these algorithms are established upon a novel elimination framework coupled with newly-developed search algorithms which accommodate the uncertainty arising from the learning behavior of the agent. We then extend the framework to handle the exploratory learning agent and develop an algorithm to achieve a $widetilde{O}(T^{2/3})$ regret bound in i.i.d. reward setup by enhancing the robustness of our elimination framework to the potential agent exploration. Finally, when reducing our agent behaviors to the one studied in (Dogan et al., 2023a), we propose an algorithm based on our robust framework, which achieves a $widetilde{O}(sqrt{T})$ regret bound, significantly improving upon their $widetilde{O}(T^{11/12})$ bound.

Problem

Research questions and friction points this paper is trying to address.

Modeling self-interested learning agents with exploration behaviors in principal-agent bandit games.

Developing algorithms for i.i.d. and linear reward settings with bandit feedback.

Achieving improved regret bounds by enhancing robustness to agent exploration.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Novel elimination framework for learning agents

Robust algorithms for exploratory agent behaviors

Improved regret bounds with adaptive search

🔎 Similar Papers

A Survey on Self-play Methods in Reinforcement Learning