🤖 AI Summary
This work addresses the instability of gradient temporal difference (GTD) learning when the feature interaction matrix (FIM) is singular. To resolve this issue, the authors propose a regularized GTD algorithm (R-GTD), which introduces a regularization term into the mean squared projected Bellman error (MSPBE) objective. This approach provides, for the first time within the GTD framework, a robust solution that remains stable even when the FIM is non-invertible, guaranteeing convergence to a unique solution. Theoretical analysis establishes the convergence of R-GTD and derives an explicit error bound. Empirical results further demonstrate the superior stability and effectiveness of the proposed method in scenarios involving singular FIMs.
📝 Abstract
Gradient temporal-difference (GTD) learning algorithms are widely used for off-policy policy evaluation with function approximation. However, existing convergence analyses rely on the restrictive assumption that the so-called feature interaction matrix (FIM) is nonsingular. In practice, the FIM can become singular and leads to instability or degraded performance. In this paper, we propose a regularized optimization objective by reformulating the mean-square projected Bellman error (MSPBE) minimization. This formulation naturally yields a regularized GTD algorithms, referred to as R-GTD, which guarantees convergence to a unique solution even when the FIM is singular. We establish theoretical convergence guarantees and explicit error bounds for the proposed method, and validate its effectiveness through empirical experiments.