Competing Bandits in Decentralized Large Contextual Matching Markets

๐Ÿ“… 2024-11-18
๐Ÿ›๏ธ arXiv.org
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This paper studies sequential learning in decentralized two-sided matching markets, where multiple agents face resource constraints and evolving preferences: demand-side agents compete over limited supply-side arms (providers) each round under time-varying environmental preferences, and arm rewards are modeled as linear functions of agent-specific parameters and shared context vectors. Addressing this non-stationary, competitive, and context-dependent setting, we introduce linear contextual bandits into decentralized matching for the first time, proposing an instance-dependent algorithm achieving logarithmic regret. We prove that each agentโ€™s individual regret is $O(log T)$, independent of the number of arms $K$โ€”surpassing the conventional $O(K)$ linear regret barrier. Experiments demonstrate significant improvements over baselines including Explore-Then-Commit and UCB. Our core contribution lies in jointly modeling dynamic stable matching and linear contextual learning, thereby establishing the first $K$-independent optimal regret bound for decentralized matching under linear contexts.

Technology Category

Application Category

๐Ÿ“ Abstract
Sequential learning in a multi-agent resource constrained matching market has received significant interest in the past few years. We study decentralized learning in two-sided matching markets where the demand side (aka players or agents) competes for a `large' supply side (aka arms) with potentially time-varying preferences, to obtain a stable match. Despite a long line of work in the recent past, existing learning algorithms such as Explore-Then-Commit or Upper-Confidence-Bound remain inefficient for this problem. In particular, the per-agent regret achieved by these algorithms scales linearly with the number of arms, $K$. Motivated by the linear contextual bandit framework, we assume that for each agent an arm-mean can be represented by a linear function of a known feature vector and an unknown (agent-specific) parameter. Moreover, our setup captures the essence of a dynamic (non-stationary) matching market where the preferences over arms change over time. Our proposed algorithms achieve instance-dependent logarithmic regret, scaling independently of the number of arms, $K$.
Problem

Research questions and friction points this paper is trying to address.

Decentralized learning in two-sided matching markets
Time-varying preferences for stable matching
Non-stationary latent environment identification
Innovation

Methods, ideas, or system contributions that make the work stand out.

Decentralized learning in matching markets
Linear contextual bandit framework
Instance-dependent logarithmic regret algorithm
๐Ÿ”Ž Similar Papers
No similar papers found.
S
Satush Parikh
Department of Electrical Engineering, IIT Bombay
Soumya Basu
Soumya Basu
Google, New York
A
Avishek Ghosh
Department of Computer Science and Engineering, IIT Bombay
Abishek Sankararaman
Abishek Sankararaman
ML Scientist, AWS
Online AlgorithmsStatistical LearningStochastic Networks