Multi-Agent Stage-wise Conservative Linear Bandits

📅 2025-10-01

📈 Citations: 0

✨ Influential: 0

career value

230K/year

🤖 AI Summary

This paper studies the stochastic linear bandit problem in multi-agent networks under phase-wise safety constraints—requiring that the expected reward per round be at least $(1-alpha)$ times that of a baseline policy. Each agent observes only local rewards and an unknown global parameter, and communicates exclusively with its neighbors; such communication incurs additional regret. We propose MA-SCLUCB, a distributed algorithm integrating linear UCB, phase-wise safe action selection, local consensus, and parameter averaging. We establish a high-probability cumulative regret bound of $ ilde{O}ig(d/sqrt{N} cdot sqrt{T} log(NT)/sqrt{log(1/|lambda_2|)}ig)$, improving upon the single-agent rate by a factor of $1/sqrt{N}$; communication cost scales logarithmically with the network’s spectral gap. To our knowledge, this is the first distributed linear bandit framework achieving a tight tripartite trade-off among safety-aware exploration, collaborative learning gain, and low communication overhead.

Technology Category

Application Category

📝 Abstract

In many real-world applications such as recommendation systems, multiple learning agents must balance exploration and exploitation while maintaining safety guarantees to avoid catastrophic failures. We study the stochastic linear bandit problem in a multi-agent networked setting where agents must satisfy stage-wise conservative constraints. A network of $N$ agents collaboratively maximizes cumulative reward while ensuring that the expected reward at every round is no less than $(1-α)$ times that of a baseline policy. Each agent observes local rewards with unknown parameters, but the network optimizes for the global parameter (average of local parameters). Agents communicate only with immediate neighbors, and each communication round incurs additional regret. We propose MA-SCLUCB (Multi-Agent Stage-wise Conservative Linear UCB), an episodic algorithm alternating between action selection and consensus-building phases. We prove that MA-SCLUCB achieves regret $ ilde{O}left(frac{d}{sqrt{N}}sqrt{T}cdotfrac{log(NT)}{sqrt{log(1/|λ_2|)}} ight)$ with high probability, where $d$ is the dimension, $T$ is the horizon, and $|λ_2|$ is the network's second largest eigenvalue magnitude. Our analysis shows: (i) collaboration yields $frac{1}{sqrt{N}}$ improvement despite local communication, (ii) communication overhead grows only logarithmically for well-connected networks, and (iii) stage-wise safety adds only lower-order regret. Thus, distributed learning with safety guarantees achieves near-optimal performance in reasonably connected networks.

Problem

Research questions and friction points this paper is trying to address.

Multi-agent systems balance exploration with safety constraints

Networked agents optimize global rewards using local observations

Algorithm ensures stage-wise performance above baseline policy

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-agent algorithm with stage-wise conservative constraints

Episodic method alternating action and consensus phases

Distributed learning with logarithmic communication overhead growth

🔎 Similar Papers

Multi-Player Approaches for Dueling Bandits