Value Improved Actor Critic Algorithms

📅 2024-06-03

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing Actor-Critic (AC) algorithms face a fundamental trade-off between greediness and stability during policy improvement: gradient-based updates tend to be overly conservative, whereas aggressive greedy actions often destabilize training. To address this, we propose the Value-Improved AC (VI-AC) framework—the first AC architecture to jointly and concurrently improve both policy (actor) and value (critic) functions, thereby unifying strengths of policy-based and value-based methods. We instantiate VI-AC as two plug-and-play algorithms—VI-TD3 and VI-DDPG—that integrate Bellman operator-driven value enhancement, target networks, and delayed updates for stable off-policy learning. Evaluated across all MuJoCo benchmark environments, both variants achieve significantly improved sample efficiency and training stability, matching or surpassing state-of-the-art baselines. These results empirically validate the effectiveness and generalizability of the dual-improvement paradigm in AC frameworks.

Technology Category

Application Category

📝 Abstract

Many modern reinforcement learning algorithms build on the actor-critic (AC) framework: iterative improvement of a policy (the actor) using policy improvement operators and iterative approximation of the policy's value (the critic). In contrast, the popular value-based algorithm family employs improvement operators in the value update, to iteratively improve the value function directly. In this work, we propose a general extension to the AC framework that employs two separate improvement operators: one applied to the policy in the spirit of policy-based algorithms and one applied to the value in the spirit of value-based algorithms, which we dub Value-Improved AC (VI-AC). We design two practical VI-AC algorithms based in the popular online off-policy AC algorithms TD3 and DDPG. We evaluate VI-TD3 and VI-DDPG in the Mujoco benchmark and find that both improve upon or match the performance of their respective baselines in all environments tested.

Problem

Research questions and friction points this paper is trying to address.

Balancing greedification and stability in Actor Critic algorithms

Improving policy updates with value-improvement operators

Enhancing performance in continuous control environments

Innovation

Methods, ideas, or system contributions that make the work stand out.

Extends Actor Critic with value-improvement operator

Balances greedification and stability in policy updates

Improves performance in continuous control environments

🔎 Similar Papers

No similar papers found.

Authors to Follow