Value Improved Actor Critic Algorithms

📅 2024-06-03
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing Actor-Critic (AC) algorithms face a fundamental trade-off between greediness and stability during policy improvement: gradient-based updates tend to be overly conservative, whereas aggressive greedy actions often destabilize training. To address this, we propose the Value-Improved AC (VI-AC) framework—the first AC architecture to jointly and concurrently improve both policy (actor) and value (critic) functions, thereby unifying strengths of policy-based and value-based methods. We instantiate VI-AC as two plug-and-play algorithms—VI-TD3 and VI-DDPG—that integrate Bellman operator-driven value enhancement, target networks, and delayed updates for stable off-policy learning. Evaluated across all MuJoCo benchmark environments, both variants achieve significantly improved sample efficiency and training stability, matching or surpassing state-of-the-art baselines. These results empirically validate the effectiveness and generalizability of the dual-improvement paradigm in AC frameworks.

Technology Category

Application Category

📝 Abstract
Many modern reinforcement learning algorithms build on the actor-critic (AC) framework: iterative improvement of a policy (the actor) using policy improvement operators and iterative approximation of the policy's value (the critic). In contrast, the popular value-based algorithm family employs improvement operators in the value update, to iteratively improve the value function directly. In this work, we propose a general extension to the AC framework that employs two separate improvement operators: one applied to the policy in the spirit of policy-based algorithms and one applied to the value in the spirit of value-based algorithms, which we dub Value-Improved AC (VI-AC). We design two practical VI-AC algorithms based in the popular online off-policy AC algorithms TD3 and DDPG. We evaluate VI-TD3 and VI-DDPG in the Mujoco benchmark and find that both improve upon or match the performance of their respective baselines in all environments tested.
Problem

Research questions and friction points this paper is trying to address.

Balancing greedification and stability in Actor Critic algorithms
Improving policy updates with value-improvement operators
Enhancing performance in continuous control environments
Innovation

Methods, ideas, or system contributions that make the work stand out.

Extends Actor Critic with value-improvement operator
Balances greedification and stability in policy updates
Improves performance in continuous control environments
🔎 Similar Papers
No similar papers found.
Yaniv Oren
Yaniv Oren
PhD candidate, Delft University of Technology
Reinforcement Learning
M
Moritz A. Zanger
Delft University of Technology, 2628 CD Delft, The Netherlands
P
Pascal R. van der Vaart
Delft University of Technology, 2628 CD Delft, The Netherlands
M
M. Spaan
Delft University of Technology, 2628 CD Delft, The Netherlands
Wendelin Böhmer
Wendelin Böhmer
Sequential Decision Making Group, Delft University of Technology
artificial intelligencemachine learningreinforcement learningmulti-agent systems