🤖 AI Summary
This paper identifies an intrinsic instability in Q-learning within continuous-state environments—beyond conventional explanations rooted in bootstrapping bias and function approximation error. Through systematic ablation studies—including decoupling target value updates, eliminating approximation error, and employing exact Q-function evaluation—we observe divergence of Q-value iteration even on simplified benchmark tasks. Our results demonstrate that Q-learning’s core learning paradigm—policy-dependent iterative estimation of target values—is fundamentally ill-posed. This is the first empirically grounded evidence that Q-learning’s instability arises from its methodological foundations, rather than implementation-specific flaws. The finding challenges the theoretical reliability and practical robustness of Q-learning as a general-purpose reinforcement learning algorithm. It provides a novel conceptual basis for designing stabilization mechanisms, shifting the focus from engineering heuristics to addressing the inherent ill-posedness of the Bellman optimality operator under policy-dependent targets.
📝 Abstract
This paper investigates the instability of Q-learning in continuous environments, a challenge frequently encountered by practitioners. Traditionally, this instability is attributed to bootstrapping and regression model errors. Using a representative reinforcement learning benchmark, we systematically examine the effects of bootstrapping and model inaccuracies by incrementally eliminating these potential error sources. Our findings reveal that even in relatively simple benchmarks, the fundamental task of Q-learning - iteratively learning a Q-function from policy-specific target values - can be inherently ill-posed and prone to failure. These insights cast doubt on the reliability of Q-learning as a universal solution for reinforcement learning problems.