Thompson Sampling for Multi-Objective Linear Contextual Bandit

📅 2025-11-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the multi-objective linear contextual bandit problem, where multiple potentially conflicting objectives must be simultaneously optimized. To this end, we propose MOL-TS—the first Thompson sampling-based algorithm achieving Pareto optimality—introducing the notion of an “effective Pareto frontier” to enable joint parameter sampling across objectives and eliminate redundant arm selections. MOL-TS integrates high-dimensional linear feature modeling, online learning, and dynamic estimation of the Pareto-optimal solution set to ensure efficient arm selection. We establish a Pareto regret bound of $widetilde{O}(d^{3/2}sqrt{T})$, the first rigorous regret guarantee for any Thompson sampling method in the multi-objective linear bandit setting. Empirical evaluations demonstrate that MOL-TS significantly outperforms existing baselines in both Pareto regret control and balanced multi-objective performance.

Technology Category

Application Category

📝 Abstract
We study the multi-objective linear contextual bandit problem, where multiple possible conflicting objectives must be optimized simultaneously. We propose exttt{MOL-TS}, the extit{first} Thompson Sampling algorithm with Pareto regret guarantees for this problem. Unlike standard approaches that compute an empirical Pareto front each round, exttt{MOL-TS} samples parameters across objectives and efficiently selects an arm from a novel emph{effective Pareto front}, which accounts for repeated selections over time. Our analysis shows that exttt{MOL-TS} achieves a worst-case Pareto regret bound of $widetilde{O}(d^{3/2}sqrt{T})$, where $d$ is the dimension of the feature vectors, $T$ is the total number of rounds, matching the best known order for randomized linear bandit algorithms for single objective. Empirical results confirm the benefits of our proposed approach, demonstrating improved regret minimization and strong multi-objective performance.
Problem

Research questions and friction points this paper is trying to address.

Optimizes conflicting objectives in multi-objective linear contextual bandit
Proposes Thompson Sampling algorithm with Pareto regret guarantees
Achieves efficient regret bound matching single-objective best known order
Innovation

Methods, ideas, or system contributions that make the work stand out.

Thompson Sampling for multi-objective linear contextual bandit
Efficient arm selection from novel effective Pareto front
Achieves Pareto regret bound matching single-objective best order
🔎 Similar Papers
No similar papers found.
S
Somangchan Park
Seoul National University
H
Heesang Ann
Seoul National University
Min-hwan Oh
Min-hwan Oh
Seoul National University
Reinforcement LearningBandit AlgorithmsMachine Learning