🤖 AI Summary
Existing online learning methods for the Linear Quadratic Regulator (LQR) rely on strong assumptions—particularly global excitability—to achieve theoretical guarantees, severely limiting their applicability.
Method: This paper proposes a novel approximate Thompson sampling algorithm that innovatively integrates preconditioned Langevin dynamics with an adaptive excitation mechanism. Crucially, it operates without assuming system excitability or other restrictive identifiability conditions.
Contribution/Results: The method establishes, for the first time without such assumptions, nontrivial concentration of the approximate posterior distribution. This enables a tight Bayesian regret analysis, yielding an $ ilde{O}(sqrt{T})$ Bayesian regret upper bound—significantly improving upon prior approaches requiring strong assumptions. By unifying perspectives from Bayesian reinforcement learning, stochastic control, and system identification, this work provides a more general and robust theoretical framework and algorithmic paradigm for online LQR learning.
📝 Abstract
We propose a novel Thompson sampling algorithm that learns linear quadratic regulators (LQR) with a Bayesian regret bound of $O(sqrt{T})$. Our method leverages Langevin dynamics with a carefully designed preconditioner and incorporates a simple excitation mechanism. We show that the excitation signal drives the minimum eigenvalue of the preconditioner to grow over time, thereby accelerating the approximate posterior sampling process. Furthermore, we establish nontrivial concentration properties of the approximate posteriors generated by our algorithm. These properties enable us to bound the moments of the system state and attain an $O(sqrt{T})$ regret bound without relying on the restrictive assumptions that are often used in the literature.