Explore-then-Commit for Nonstationary Linear Bandits with Latent Dynamics

📅 2025-10-17

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

This paper studies nonstationary bandits with action-dependent latent states, where the underlying state evolution follows an unknown linear dynamical system directly influenced by actions—inducing an inherent trade-off between immediate rewards and long-term system dynamics. To address this, we propose an “explore-then-commit” framework: first, excite the system with Rademacher-noise actions to enable temporally correlated identification of Markov parameters and derive finite-sample identification error bounds; second, formulate long-horizon reward optimization as an indefinite quadratic program, solved efficiently under hypercube constraints via a bilinear reward model, semidefinite relaxation, and Goemans–Williamson rounding. Theoretically, our method achieves a regret bound of $ ilde{mathcal{O}}(T^{2/3})$, establishing the first finite-time optimal exploration-exploitation balance for high-dimensional linear dynamical systems in nonstationary bandit settings.

Technology Category

Application Category

📝 Abstract

We study a nonstationary bandit problem where rewards depend on both actions and latent states, the latter governed by unknown linear dynamics. Crucially, the state dynamics also depend on the actions, resulting in tension between short-term and long-term rewards. We propose an explore-then-commit algorithm for a finite horizon $T$. During the exploration phase, random Rademacher actions enable estimation of the Markov parameters of the linear dynamics, which characterize the action-reward relationship. In the commit phase, the algorithm uses the estimated parameters to design an optimized action sequence for long-term reward. Our proposed algorithm achieves $ ilde{mathcal{O}}(T^{2/3})$ regret. Our analysis handles two key challenges: learning from temporally correlated rewards, and designing action sequences with optimal long-term reward. We address the first challenge by providing near-optimal sample complexity and error bounds for system identification using bilinear rewards. We address the second challenge by proving an equivalence with indefinite quadratic optimization over a hypercube, a known NP-hard problem. We provide a sub-optimality guarantee for this problem, enabling our regret upper bound. Lastly, we propose a semidefinite relaxation with Goemans-Williamson rounding as a practical approach.

Problem

Research questions and friction points this paper is trying to address.

Solving nonstationary bandits with latent states and action-dependent dynamics

Balancing short-term rewards against long-term optimal action sequences

Addressing temporally correlated rewards and NP-hard optimization challenges

Innovation

Methods, ideas, or system contributions that make the work stand out.

Explore-then-commit algorithm with Rademacher actions

Estimation of Markov parameters for linear dynamics

Semidefinite relaxation with Goemans-Williamson rounding

🔎 Similar Papers

Non-Stationary Latent Auto-Regressive Bandits