GAE Falls Short in Imperfect-Information Self-Play Reinforcement Learning

📅 2026-05-18

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

This work addresses the high variance induced by stochastic action sampling in Generalized Advantage Estimation (GAE) within self-play reinforcement learning for imperfect-information games, which undermines the stability and efficiency of policy optimization. To mitigate this issue, the authors propose Variance-Reduced Policy Optimization (VRPO), which first identifies the inherent variance problem of GAE in equilibrium-based self-play and introduces a Q-boosting advantage estimator grounded in a centralized action-value critic. This estimator replaces sampled backups with multi-step Expected SARSA(λ) traces, computing expected policy updates at each step to eliminate action noise while preserving Proximal Policy Optimization’s (PPO) clipped objective and online update mechanism. Empirical results demonstrate that VRPO significantly outperforms baseline methods in medium-to-large-scale imperfect-information games such as DouDizhu and heads-up no-limit Texas Hold’em, achieving more stable and sample-efficient policy learning.

📝 Abstract

Competitive multi-agent reinforcement learning in imperfect-information games requires agents to act under partial observability and against adversarial opponents, necessitating stochastic policies. While self-play reinforcement learning with Proximal Policy Optimization (PPO) has achieved strong empirical success, its standard advantage estimator, generalized advantage estimation, suffers from additional variance due to the sampling of stochastic future actions. This variance is amplified in equilibrium self-play because of the stochastic nature of the equilibrium policy and persists even when the critic is exact. We address this bottleneck by introducing $Q$-boosting, a variance-reduced advantage estimator based on a centralized action-value critic, and propose Variance-Reduced Policy Optimization (VRPO), incorporating this new estimator. The algorithm replaces sampled multi-step backups with a multi-step Expected SARSA$(λ)$ trace, computing policy expectations at each step to average out action-sampling noise, while retaining PPO's clipped objective and on-policy actor updates. Empirically, VRPO consistently achieves strong performance from mid-sized to large-scale games including Dou Dizhu and Heads-Up No-Limit Texas Hold'em.

Problem

Research questions and friction points this paper is trying to address.

imperfect-information games

self-play reinforcement learning

stochastic policies

variance reduction

generalized advantage estimation

Innovation

Methods, ideas, or system contributions that make the work stand out.

variance reduction

imperfect-information games

self-play reinforcement learning