π€ AI Summary
In deep reinforcement learning, the two classical interpretations of temporal difference (TD) error are no longer equivalent under nonlinear function approximation. This work presents the first systematic analysis demonstrating this inequivalence: when deep neural networks are employed, the numerical values of the two TD error formulations diverge significantly, leading to substantial impacts on TD errorβbased algorithms, such as those for average reward settings. Through both theoretical analysis and empirical validation, we show that the standard TD error definition may fail in deep RL contexts, and that the choice of error interpretation critically influences algorithmic performance. These findings offer a new perspective for improving deep TD learning methods.
π Abstract
The temporal difference (TD) error was first formalized in Sutton (1988), where it was first characterized as the difference between temporally successive predictions, and later, in that same work, formulated as the difference between a bootstrapped target and a prediction. Since then, these two interpretations of the TD error have been used interchangeably in the literature, with the latter eventually being adopted as the standard critic loss in deep reinforcement learning (RL) architectures. In this work, we show that these two interpretations of the TD error are not always equivalent. In particular, we show that increasingly-nonlinear deep RL architectures can cause these interpretations of the TD error to yield increasingly different numerical values. Then, building on this insight, we show how choosing one interpretation of the TD error over the other can affect the performance of deep RL algorithms that utilize the TD error to compute other quantities, such as with deep differential (i.e., average-reward) RL methods. All in all, our results show that the default interpretation of the TD error as the difference between a bootstrapped target and a prediction does not always hold in deep RL settings.