🤖 AI Summary
This paper studies the safe online Linear Quadratic Regulator (LQR) learning problem under state constraints: the system dynamics are unknown, and the state trajectory must remain within a predefined safe region with high probability throughout execution. To overcome the limitation of conventional linear controllers—whose exploration-safety trade-off is inherently constrained—we propose the first general analytical framework for nonlinear constrained control. Our framework reveals that, under sufficiently large-support process noise, safety constraints can enable “free exploration.” By designing a nonlinear controller, deriving tight uncertainty estimation bounds, establishing high-probability safety guarantees, and analyzing regret, we obtain sharp theoretical results: a $ ilde{O}(sqrt{T})$ regret bound under large-support noise, and a $ ilde{O}(T^{2/3})$ bound under arbitrary sub-Gaussian noise. This work provides the first theoretical foundation for safe online reinforcement learning that simultaneously ensures safety and sample efficiency.
📝 Abstract
Many practical applications of online reinforcement learning require the satisfaction of safety constraints while learning about the unknown environment. In this work, we establish theoretical foundations for reinforcement learning with safety constraints by studying the canonical problem of Linear Quadratic Regulator learning with unknown dynamics, but with the additional constraint that the position must stay within a safe region for the entire trajectory with high probability. Our primary contribution is a general framework for studying stronger baselines of nonlinear controllers that are better suited for constrained problems than linear controllers. Due to the difficulty of analyzing non-linear controllers in a constrained problem, we focus on 1-dimensional state- and action- spaces, however we also discuss how we expect the high-level takeaways can generalize to higher dimensions. Using our framework, we show that for emph{any} non-linear baseline satisfying natural assumptions, $ ilde{O}_T(sqrt{T})$-regret is possible when the noise distribution has sufficiently large support, and $ ilde{O}_T(T^{2/3})$-regret is possible for emph{any} subgaussian noise distribution. In proving these results, we introduce a new uncertainty estimation bound for nonlinear controls which shows that enforcing safety in the presence of sufficient noise can provide ``free exploration'' that compensates for the added cost of uncertainty in safety-constrained control.