Pseudo-Quantized Actor-Critic Algorithm for Robustness to Noisy Temporal Difference Error

📅 2026-04-02

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

This work addresses the instability in reinforcement learning caused by the susceptibility of temporal difference (TD) errors to noise due to bootstrapping. Within the control-as-inference framework, the authors propose a noise-robust learning rule that models the distribution over optimality binary variables and combines forward and reverse Kullback–Leibler (KL) divergences. A pseudo-quantization mechanism is introduced, leveraging the saturation property of the sigmoid function to automatically suppress gradients from outlier TD errors. To balance the strengths of both KL divergences, the method employs a Jensen–Shannon divergence approximation. Evaluated on standard reinforcement learning benchmarks, the approach achieves stable and efficient learning even under noisy rewards or when conventional heuristics fail, demonstrating significantly improved robustness to noise.

Technology Category

Application Category

📝 Abstract

In reinforcement learning (RL), temporal difference (TD) errors are widely adopted for optimizing value and policy functions. However, since the TD error is defined by a bootstrap method, its computation tends to be noisy and destabilize learning. Heuristics to improve the accuracy of TD errors, such as target networks and ensemble models, have been introduced so far. While these are essential approaches for the current deep RL algorithms, they cause side effects like increased computational cost and reduced learning efficiency. Therefore, this paper revisits the TD learning algorithm based on control as inference, deriving a novel algorithm capable of robust learning against noisy TD errors. First, the distribution model of optimality, a binary random variable, is represented by a sigmoid function. Alongside forward and reverse Kullback-Leibler divergences, this new model derives a robust learning rule: when the sigmoid function saturates with a large TD error probably due to noise, the gradient vanishes, implicitly excluding it from learning. Furthermore, the two divergences exhibit distinct gradient-vanishing characteristics. Building on these analyses, the optimality is decomposed into multiple levels to achieve pseudo-quantization of TD errors, aiming for further noise reduction. Additionally, a Jensen-Shannon divergence-based approach is approximately derived to inherit the characteristics of both divergences. These benefits are verified through RL benchmarks, demonstrating stable learning even when heuristics are insufficient or rewards contain noise.

Problem

Research questions and friction points this paper is trying to address.

temporal difference error

reinforcement learning

noise robustness

learning stability

value estimation

Innovation

Methods, ideas, or system contributions that make the work stand out.

pseudo-quantization

temporal difference error

robust reinforcement learning