🤖 AI Summary
This paper addresses the online linear quadratic regulator (LQR) control problem under dynamically unknown system dynamics. To overcome the poor generalization and high computational complexity of conventional model-based reinforcement learning approaches—such as those relying on optimism or Thompson sampling—in continuous control settings, we introduce, for the first time, the Confusion Instance (CI) principle and the Minimum Empirical Divergence (MED) framework into LQR control. Integrating system identification, policy structure analysis, controller sensitivity, and closed-loop stability theory, we propose the MED-LQ algorithm. We establish a sublinear regret bound for MED-LQ, proving its theoretical efficacy. Empirical evaluation demonstrates that MED-LQ matches or exceeds state-of-the-art methods across multiple benchmark tasks, while exhibiting strong scalability potential toward large-scale continuous control problems.
📝 Abstract
We revisit the problem of controlling linear systems with quadratic cost under unknown dynamics with model-based reinforcement learning. Traditional methods like Optimism in the Face of Uncertainty and Thompson Sampling, rooted in multi-armed bandits (MABs), face practical limitations. In contrast, we propose an alternative based on the Confusing Instance (CI) principle, which underpins regret lower bounds in MABs and discrete Markov Decision Processes (MDPs) and is central to the Minimum Empirical Divergence (MED) family of algorithms, known for their asymptotic optimality in various settings. By leveraging the structure of LQR policies along with sensitivity and stability analysis, we develop MED-LQ. This novel control strategy extends the principles of CI and MED beyond small-scale settings. Our benchmarks on a comprehensive control suite demonstrate that MED-LQ achieves competitive performance in various scenarios while highlighting its potential for broader applications in large-scale MDPs.