GAE Falls Short in Imperfect-Information Self-Play Reinforcement Learning

📅 2026-05-18
📈 Citations: 0
Influential: 0
📄 PDF

career value

223K/year
🤖 AI Summary
This work addresses the high variance induced by stochastic action sampling in Generalized Advantage Estimation (GAE) within self-play reinforcement learning for imperfect-information games, which undermines the stability and efficiency of policy optimization. To mitigate this issue, the authors propose Variance-Reduced Policy Optimization (VRPO), which first identifies the inherent variance problem of GAE in equilibrium-based self-play and introduces a Q-boosting advantage estimator grounded in a centralized action-value critic. This estimator replaces sampled backups with multi-step Expected SARSA(λ) traces, computing expected policy updates at each step to eliminate action noise while preserving Proximal Policy Optimization’s (PPO) clipped objective and online update mechanism. Empirical results demonstrate that VRPO significantly outperforms baseline methods in medium-to-large-scale imperfect-information games such as DouDizhu and heads-up no-limit Texas Hold’em, achieving more stable and sample-efficient policy learning.
📝 Abstract
Competitive multi-agent reinforcement learning in imperfect-information games requires agents to act under partial observability and against adversarial opponents, necessitating stochastic policies. While self-play reinforcement learning with Proximal Policy Optimization (PPO) has achieved strong empirical success, its standard advantage estimator, generalized advantage estimation, suffers from additional variance due to the sampling of stochastic future actions. This variance is amplified in equilibrium self-play because of the stochastic nature of the equilibrium policy and persists even when the critic is exact. We address this bottleneck by introducing $Q$-boosting, a variance-reduced advantage estimator based on a centralized action-value critic, and propose Variance-Reduced Policy Optimization (VRPO), incorporating this new estimator. The algorithm replaces sampled multi-step backups with a multi-step Expected SARSA$(λ)$ trace, computing policy expectations at each step to average out action-sampling noise, while retaining PPO's clipped objective and on-policy actor updates. Empirically, VRPO consistently achieves strong performance from mid-sized to large-scale games including Dou Dizhu and Heads-Up No-Limit Texas Hold'em.
Problem

Research questions and friction points this paper is trying to address.

imperfect-information games
self-play reinforcement learning
stochastic policies
variance reduction
generalized advantage estimation
Innovation

Methods, ideas, or system contributions that make the work stand out.

variance reduction
imperfect-information games
self-play reinforcement learning
advantage estimation
Expected SARSA
🔎 Similar Papers