🤖 AI Summary
This paper addresses the challenge of safe, optimal control for unknown dynamical systems under non-resettable, continuous online learning—where system resets are unavailable and safety must be guaranteed throughout operation.
Method: We propose a non-episodic online learning framework leveraging a probabilistic dynamics model. It integrates pessimistic safety constraints with optimistic exploration to enable efficient learning while ensuring high-probability safety. Crucially, the method achieves finite-time, arbitrarily accurate learning of system dynamics, focusing exclusively on those dynamic features essential for optimal performance.
Contributions/Results: Theoretical analysis guarantees strict safety satisfaction at all times and bounded, controllable learning error. Experiments on high-risk benchmarks—including autonomous racing cars and pneumatic-disturbance-affected UAV navigation—demonstrate rapid convergence under safety constraints and superior closed-loop control performance.
📝 Abstract
Ensuring both optimality and safety is critical for the real-world deployment of agents, but becomes particularly challenging when the system dynamics are unknown. To address this problem, we introduce a notion of maximum safe dynamics learning via sufficient exploration in the space of safe policies. We propose a $ extit{pessimistically}$ safe framework that $ extit{optimistically}$ explores informative states and, despite not reaching them due to model uncertainty, ensures continuous online learning of dynamics. The framework achieves first-of-its-kind results: learning the dynamics model sufficiently $-$ up to an arbitrary small tolerance (subject to noise) $-$ in a finite time, while ensuring provably safe operation throughout with high probability and without requiring resets. Building on this, we propose an algorithm to maximize rewards while learning the dynamics $ extit{only to the extent needed}$ to achieve close-to-optimal performance. Unlike typical reinforcement learning (RL) methods, our approach operates online in a non-episodic setting and ensures safety throughout the learning process. We demonstrate the effectiveness of our approach in challenging domains such as autonomous car racing and drone navigation under aerodynamic effects $-$ scenarios where safety is critical and accurate modeling is difficult.