A Reduction Algorithm for Markovian Contextual Linear Bandits

📅 2026-03-12

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

This work addresses the contextual linear bandit problem where the action set evolves over time according to an exogenous Markov chain, thereby relaxing the conventional i.i.d. context assumption. By constructing a stationary surrogate action set, the non-stationary Markovian contextual problem is reduced to a standard single-context linear bandit. The authors introduce a novel framework that combines delayed updates with a phased learning strategy to control distributional shift, extending the “cheap context” perspective for the first time to Markov-dependent settings. They propose a general reduction framework applicable to uniformly geometrically ergodic chains, enabling online learning of the surrogate mapping even when the transition dynamics are unknown. Under both known and unknown transition distributions, the approach achieves high-probability worst-case regret bounds matching those of the underlying oracle, with only lower-order dependence on the mixing time.

Technology Category

Application Category

📝 Abstract

Recent work shows that when contexts are drawn i.i.d., linear contextual bandits can be reduced to single-context linear bandits. This ``contexts are cheap" perspective is highly advantageous, as it allows for sharper finite-time analyses and leverages mature techniques from the linear bandit literature, such as those for misspecification and adversarial corruption. Motivated by applications with temporally correlated availability, we extend this perspective to Markovian contextual linear bandits, where the action set evolves via an exogenous Markov chain. Our main contribution is a reduction that applies under uniform geometric ergodicity. We construct a stationary surrogate action set to solve the problem using a standard linear bandit oracle, employing a delayed-update scheme to control the bias induced by the nonstationary conditional context distributions. We further provide a phased algorithm for unknown transition distributions that learns the surrogate mapping online. In both settings, we obtain a high-probability worst-case regret bound matching that of the underlying linear bandit oracle, with only lower-order dependence on the mixing time.

Problem

Research questions and friction points this paper is trying to address.

Markovian contextual bandits

linear bandits

contextual bandits

reduction

nonstationary contexts

Innovation

Methods, ideas, or system contributions that make the work stand out.

Markovian contextual bandits

reduction algorithm

uniform geometric ergodicity