Constrained Linear Thompson Sampling

📅 2025-03-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper studies the safe linear multi-armed bandit problem: at each round, an action is selected from a convex feasible set to maximize cumulative reward while satisfying unknown linear constraints—both the objective function and constraint coefficients are initially unknown. We propose COLTS, a sampling-based framework that jointly estimates the reward vector and constraint matrix via coupled noise perturbations, enabling principled trade-offs among exploration, exploitation, and safety. Our key contributions include (i) the first coupled noise design and adaptive scaling analysis technique, unifying treatment of strictly safe, softly safe, and prior-free constraint settings; and (ii) achieving the optimal regret bound of Õ(√(d³T)) with zero constraint violations or controllable risk. Compared to optimism-based approaches, COLTS significantly reduces computational complexity while preserving theoretical optimality and practical scalability.

Technology Category

Application Category

📝 Abstract
We study the safe linear bandit problem, where an agent sequentially selects actions from a convex domain to maximize an unknown objective while ensuring unknown linear constraints are satisfied on a per-round basis. Existing approaches primarily rely on optimism-based methods with frequentist confidence bounds, often leading to computationally expensive action selection routines. We propose COnstrained Linear Thompson Sampling (COLTS), a sampling-based framework that efficiently balances regret minimization and constraint satisfaction by selecting actions on the basis of noisy perturbations of the estimates of the unknown objective vector and constraint matrix. We introduce three variants of COLTS, distinguished by the learner's available side information: - S-COLTS assumes access to a known safe action and ensures strict constraint enforcement by combining the COLTS approach with a rescaling towards the safe action. For $d$-dimensional actions, this yields $ ilde{O}(sqrt{d^3 T})$ regret and zero constraint violations (or risk). - E-COLTS enforces constraints softly under Slater's condition, and attains regret and risk of $ ilde{O}(sqrt{d^3 T})$ by combining COLTS with uniform exploration. - R-COLTS requires no side information, and ensures instance-independent regret and risk of $ ilde{O}(sqrt{d^3 T})$ by leveraging repeated resampling. A key technical innovation is a coupled noise design, which maintains optimism while preserving computational efficiency, which is combined with a scaling based analysis technique to address the variation of the per-round feasible region induced by sampled constraint matrices. Our methods match the regret bounds of prior approaches, while significantly reducing computational costs compared to them, thus yielding a scalable and practical approach for constrained bandit linear optimization.
Problem

Research questions and friction points this paper is trying to address.

Maximizes unknown objective while satisfying linear constraints.
Reduces computational costs in safe linear bandit problems.
Introduces efficient sampling-based framework for constraint satisfaction.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Sampling-based framework for safe linear bandits
Coupled noise design for computational efficiency
Scaling analysis for per-round feasible regions
🔎 Similar Papers
No similar papers found.