🤖 AI Summary
This paper addresses the multi-objective linear contextual bandit problem, where multiple potentially conflicting objectives must be simultaneously optimized. To this end, we propose MOL-TS—the first Thompson sampling-based algorithm achieving Pareto optimality—introducing the notion of an “effective Pareto frontier” to enable joint parameter sampling across objectives and eliminate redundant arm selections. MOL-TS integrates high-dimensional linear feature modeling, online learning, and dynamic estimation of the Pareto-optimal solution set to ensure efficient arm selection. We establish a Pareto regret bound of $widetilde{O}(d^{3/2}sqrt{T})$, the first rigorous regret guarantee for any Thompson sampling method in the multi-objective linear bandit setting. Empirical evaluations demonstrate that MOL-TS significantly outperforms existing baselines in both Pareto regret control and balanced multi-objective performance.
📝 Abstract
We study the multi-objective linear contextual bandit problem, where multiple possible conflicting objectives must be optimized simultaneously. We propose exttt{MOL-TS}, the extit{first} Thompson Sampling algorithm with Pareto regret guarantees for this problem. Unlike standard approaches that compute an empirical Pareto front each round, exttt{MOL-TS} samples parameters across objectives and efficiently selects an arm from a novel emph{effective Pareto front}, which accounts for repeated selections over time. Our analysis shows that exttt{MOL-TS} achieves a worst-case Pareto regret bound of $widetilde{O}(d^{3/2}sqrt{T})$, where $d$ is the dimension of the feature vectors, $T$ is the total number of rounds, matching the best known order for randomized linear bandit algorithms for single objective. Empirical results confirm the benefits of our proposed approach, demonstrating improved regret minimization and strong multi-objective performance.