🤖 AI Summary
Existing Actor-Critic (AC) algorithms face a fundamental trade-off between greediness and stability during policy improvement: gradient-based updates tend to be overly conservative, whereas aggressive greedy actions often destabilize training. To address this, we propose the Value-Improved AC (VI-AC) framework—the first AC architecture to jointly and concurrently improve both policy (actor) and value (critic) functions, thereby unifying strengths of policy-based and value-based methods. We instantiate VI-AC as two plug-and-play algorithms—VI-TD3 and VI-DDPG—that integrate Bellman operator-driven value enhancement, target networks, and delayed updates for stable off-policy learning. Evaluated across all MuJoCo benchmark environments, both variants achieve significantly improved sample efficiency and training stability, matching or surpassing state-of-the-art baselines. These results empirically validate the effectiveness and generalizability of the dual-improvement paradigm in AC frameworks.
📝 Abstract
Many modern reinforcement learning algorithms build on the actor-critic (AC) framework: iterative improvement of a policy (the actor) using policy improvement operators and iterative approximation of the policy's value (the critic). In contrast, the popular value-based algorithm family employs improvement operators in the value update, to iteratively improve the value function directly. In this work, we propose a general extension to the AC framework that employs two separate improvement operators: one applied to the policy in the spirit of policy-based algorithms and one applied to the value in the spirit of value-based algorithms, which we dub Value-Improved AC (VI-AC). We design two practical VI-AC algorithms based in the popular online off-policy AC algorithms TD3 and DDPG. We evaluate VI-TD3 and VI-DDPG in the Mujoco benchmark and find that both improve upon or match the performance of their respective baselines in all environments tested.