On the optimal regret of collaborative personalized linear bandits

📅 2025-06-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper studies the personalized linear contextual bandit problem in multi-agent collaborative settings, where agents face heterogeneous unknown parameters and naively learn independently, ignoring inter-agent similarities. To characterize the trade-off between collaboration gains and heterogeneity, the paper establishes the first information-theoretic lower bound and proposes a two-stage hierarchical Bayesian algorithm: first clustering similar agents, then jointly estimating shared structural parameters. The framework yields a three-regime optimal regret bound—$ ilde{O}(dsqrt{mn})$, $ ilde{O}(dm^{1-gamma}sqrt{n})$, and $ ilde{O}(dmsqrt{n})$—substantially improving over the independent learning baseline $O(dmsqrt{n})$. Key contributions are: (1) the first information-theoretic lower bound for heterogeneous multi-agent linear bandits; (2) a hierarchical collaborative learning mechanism adaptive to the degree of heterogeneity; and (3) a regret bound that continuously interpolates between “full sharing” and “full independence.”

Technology Category

Application Category

📝 Abstract
Stochastic linear bandits are a fundamental model for sequential decision making, where an agent selects a vector-valued action and receives a noisy reward with expected value given by an unknown linear function. Although well studied in the single-agent setting, many real-world scenarios involve multiple agents solving heterogeneous bandit problems, each with a different unknown parameter. Applying single agent algorithms independently ignores cross-agent similarity and learning opportunities. This paper investigates the optimal regret achievable in collaborative personalized linear bandits. We provide an information-theoretic lower bound that characterizes how the number of agents, the interaction rounds, and the degree of heterogeneity jointly affect regret. We then propose a new two-stage collaborative algorithm that achieves the optimal regret. Our analysis models heterogeneity via a hierarchical Bayesian framework and introduces a novel information-theoretic technique for bounding regret. Our results offer a complete characterization of when and how collaboration helps with a optimal regret bound $ ilde{O}(dsqrt{mn})$, $ ilde{O}(dm^{1-gamma}sqrt{n})$, $ ilde{O}(dmsqrt{n})$ for the number of rounds $n$ in the range of $(0, frac{d}{m sigma^2})$, $[frac{d}{m^{2gamma} sigma^2}, frac{d}{sigma^2}]$ and $(frac{d}{sigma^2}, infty)$ respectively, where $sigma$ measures the level of heterogeneity, $m$ is the number of agents, and $gammain[0, 1/2]$ is an absolute constant. In contrast, agents without collaboration achieve a regret bound $O(dmsqrt{n})$ at best.
Problem

Research questions and friction points this paper is trying to address.

Optimal regret in collaborative personalized linear bandits
Impact of agent count, rounds, heterogeneity on regret
Two-stage algorithm achieving optimal regret bounds
Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stage collaborative algorithm for optimal regret
Hierarchical Bayesian framework models heterogeneity
Novel information-theoretic regret bounding technique
🔎 Similar Papers
No similar papers found.
B
Bruce Huang
University of California, Los Angeles
Ruida Zhou
Ruida Zhou
Amazon AGI
information theoryreinforcement learninggeneralization
L
Lin F. Yang
University of California, Los Angeles
S
S. Diggavi
University of California, Los Angeles