🤖 AI Summary
This paper studies the personalized linear contextual bandit problem in multi-agent collaborative settings, where agents face heterogeneous unknown parameters and naively learn independently, ignoring inter-agent similarities. To characterize the trade-off between collaboration gains and heterogeneity, the paper establishes the first information-theoretic lower bound and proposes a two-stage hierarchical Bayesian algorithm: first clustering similar agents, then jointly estimating shared structural parameters. The framework yields a three-regime optimal regret bound—$ ilde{O}(dsqrt{mn})$, $ ilde{O}(dm^{1-gamma}sqrt{n})$, and $ ilde{O}(dmsqrt{n})$—substantially improving over the independent learning baseline $O(dmsqrt{n})$. Key contributions are: (1) the first information-theoretic lower bound for heterogeneous multi-agent linear bandits; (2) a hierarchical collaborative learning mechanism adaptive to the degree of heterogeneity; and (3) a regret bound that continuously interpolates between “full sharing” and “full independence.”
📝 Abstract
Stochastic linear bandits are a fundamental model for sequential decision making, where an agent selects a vector-valued action and receives a noisy reward with expected value given by an unknown linear function. Although well studied in the single-agent setting, many real-world scenarios involve multiple agents solving heterogeneous bandit problems, each with a different unknown parameter. Applying single agent algorithms independently ignores cross-agent similarity and learning opportunities. This paper investigates the optimal regret achievable in collaborative personalized linear bandits. We provide an information-theoretic lower bound that characterizes how the number of agents, the interaction rounds, and the degree of heterogeneity jointly affect regret. We then propose a new two-stage collaborative algorithm that achieves the optimal regret. Our analysis models heterogeneity via a hierarchical Bayesian framework and introduces a novel information-theoretic technique for bounding regret. Our results offer a complete characterization of when and how collaboration helps with a optimal regret bound $ ilde{O}(dsqrt{mn})$, $ ilde{O}(dm^{1-gamma}sqrt{n})$, $ ilde{O}(dmsqrt{n})$ for the number of rounds $n$ in the range of $(0, frac{d}{m sigma^2})$, $[frac{d}{m^{2gamma} sigma^2}, frac{d}{sigma^2}]$ and $(frac{d}{sigma^2}, infty)$ respectively, where $sigma$ measures the level of heterogeneity, $m$ is the number of agents, and $gammain[0, 1/2]$ is an absolute constant. In contrast, agents without collaboration achieve a regret bound $O(dmsqrt{n})$ at best.