🤖 AI Summary
To address non-stationarity and training instability caused by trainable adversaries in adversarial reinforcement learning, this paper proposes the Uncertainty-Aware Critic Ensemble (UACE) framework. Methodologically, UACE (1) constructs a diverse, parallel ensemble of critics to jointly model epistemic uncertainty, and (2) introduces a time-dependent variance-based aggregation mechanism (TDU) that dynamically weights Q-values to balance exploration and exploitation. Operating within a zero-sum Markov game setting, UACE jointly optimizes stability, robustness, and convergence of Q-value estimation. Evaluated on the MuJoCo multi-task benchmark, UACE achieves significant improvements over state-of-the-art methods: it exhibits enhanced training stability, accelerated convergence, and superior policy robustness and average return across all tasks.
📝 Abstract
Robust adversarial reinforcement learning has emerged as an effective paradigm for training agents to handle uncertain disturbance in real environments, with critical applications in sequential decision-making domains such as autonomous driving and robotic control. Within this paradigm, agent training is typically formulated as a zero-sum Markov game between a protagonist and an adversary to enhance policy robustness. However, the trainable nature of the adversary inevitably induces non-stationarity in the learning dynamics, leading to exacerbated training instability and convergence difficulties, particularly in high-dimensional complex environments. In this paper, we propose a novel approach, Uncertainty-Aware Critic Ensemble for robust adversarial Reinforcement learning (UACER), which consists of two strategies: 1) Diversified critic ensemble: a diverse set of K critic networks is exploited in parallel to stabilize Q-value estimation rather than conventional single-critic architectures for both variance reduction and robustness enhancement. 2) Time-varying Decay Uncertainty (TDU) mechanism: advancing beyond simple linear combinations, we develop a variance-derived Q-value aggregation strategy that explicitly incorporates epistemic uncertainty to dynamically regulate the exploration-exploitation trade-off while simultaneously stabilizing the training process. Comprehensive experiments across several MuJoCo control problems validate the superior effectiveness of UACER, outperforming state-of-the-art methods in terms of overall performance, stability, and efficiency.