Gradual Transition from Bellman Optimality Operator to Bellman Operator in Online Reinforcement Learning

📅 2025-06-06

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

In online reinforcement learning with continuous action spaces, low sample efficiency arises from modeling the current-policy Q-function solely via the Bellman expectation operator. To address this, we introduce the Bellman optimality operator into the actor-critic framework for the first time and propose a progressive annealing mechanism: initially favoring the Bellman optimality operator to accelerate convergence, then smoothly annealing to the standard Bellman expectation operator to mitigate overestimation bias. Our method integrates TD3/SAC architectures with target networks and twin-Q de-biasing. Experiments across diverse locomotion and manipulation tasks demonstrate significant improvements in sample efficiency, policy stability, and hyperparameter robustness—consistently outperforming state-of-the-art baselines.

Technology Category

Application Category

📝 Abstract

For continuous action spaces, actor-critic methods are widely used in online reinforcement learning (RL). However, unlike RL algorithms for discrete actions, which generally model the optimal value function using the Bellman optimality operator, RL algorithms for continuous actions typically model Q-values for the current policy using the Bellman operator. These algorithms for continuous actions rely exclusively on policy updates for improvement, which often results in low sample efficiency. This study examines the effectiveness of incorporating the Bellman optimality operator into actor-critic frameworks. Experiments in a simple environment show that modeling optimal values accelerates learning but leads to overestimation bias. To address this, we propose an annealing approach that gradually transitions from the Bellman optimality operator to the Bellman operator, thereby accelerating learning while mitigating bias. Our method, combined with TD3 and SAC, significantly outperforms existing approaches across various locomotion and manipulation tasks, demonstrating improved performance and robustness to hyperparameters related to optimality.

Problem

Research questions and friction points this paper is trying to address.

Improving sample efficiency in continuous action RL

Mitigating overestimation bias in optimal value modeling

Enhancing actor-critic frameworks with Bellman optimality

Innovation

Methods, ideas, or system contributions that make the work stand out.

Incorporates Bellman optimality operator in actor-critic

Uses annealing to transition between Bellman operators

Combines with TD3 and SAC for improved performance

🔎 Similar Papers

No similar papers found.