đ¤ AI Summary
This paper identifies an intrinsic instability in Q-learning within continuous-state environmentsâbeyond conventional explanations rooted in bootstrapping bias and function approximation error. Through systematic ablation studiesâincluding decoupling target value updates, eliminating approximation error, and employing exact Q-function evaluationâwe observe divergence of Q-value iteration even on simplified benchmark tasks. Our results demonstrate that Q-learningâs core learning paradigmâpolicy-dependent iterative estimation of target valuesâis fundamentally ill-posed. This is the first empirically grounded evidence that Q-learningâs instability arises from its methodological foundations, rather than implementation-specific flaws. The finding challenges the theoretical reliability and practical robustness of Q-learning as a general-purpose reinforcement learning algorithm. It provides a novel conceptual basis for designing stabilization mechanisms, shifting the focus from engineering heuristics to addressing the inherent ill-posedness of the Bellman optimality operator under policy-dependent targets.
đ Abstract
This paper investigates the instability of Q-learning in continuous environments, a challenge frequently encountered by practitioners. Traditionally, this instability is attributed to bootstrapping and regression model errors. Using a representative reinforcement learning benchmark, we systematically examine the effects of bootstrapping and model inaccuracies by incrementally eliminating these potential error sources. Our findings reveal that even in relatively simple benchmarks, the fundamental task of Q-learning - iteratively learning a Q-function from policy-specific target values - can be inherently ill-posed and prone to failure. These insights cast doubt on the reliability of Q-learning as a universal solution for reinforcement learning problems.