Moderate Actor-Critic Methods: Controlling Overestimation Bias via Expectile Loss

📅 2025-04-14

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

In model-free reinforcement learning (MF-RL), Q-functions often suffer from overestimation bias due to the interplay of temporal-difference learning and function approximation, undermining policy stability and performance. To address this, we propose a robust target construction based on the lower expectile of the conditional Q-value distribution—introducing, for the first time, the lower expectile as a principled lower-bound estimator for Q-values, formulated as a convex optimization problem that explicitly mitigates overestimation. Our approach is theoretically guaranteed to converge and seamlessly integrates with mainstream off-policy algorithms such as DDPG and SAC without architectural modifications. Empirical evaluation across multiple benchmark tasks demonstrates that our method significantly reduces Q-value overestimation, enhances training stability, and improves final policy performance. These results validate its general applicability and practical effectiveness across diverse model-free RL frameworks.

Technology Category

Application Category

📝 Abstract

Overestimation is a fundamental characteristic of model-free reinforcement learning (MF-RL), arising from the principles of temporal difference learning and the approximation of the Q-function. To address this challenge, we propose a novel moderate target in the Q-function update, formulated as a convex optimization of an overestimated Q-function and its lower bound. Our primary contribution lies in the efficient estimation of this lower bound through the lower expectile of the Q-value distribution conditioned on a state. Notably, our moderate target integrates seamlessly into state-of-the-art (SOTA) MF-RL algorithms, including Deep Deterministic Policy Gradient (DDPG) and Soft Actor Critic (SAC). Experimental results validate the effectiveness of our moderate target in mitigating overestimation bias in DDPG, SAC, and distributional RL algorithms.

Problem

Research questions and friction points this paper is trying to address.

Address overestimation bias in model-free RL

Propose moderate Q-function update via expectile loss

Integrate solution into SOTA algorithms like DDPG and SAC

Innovation

Methods, ideas, or system contributions that make the work stand out.

Moderate target via convex optimization

Lower bound estimation using expectile

Seamless integration with SOTA algorithms

🔎 Similar Papers

A Sharper Global Convergence Analysis for Average Reward Reinforcement Learning via an Actor-Critic Approach