🤖 AI Summary
This work investigates zero-shot cross-seed coordination in the Hanabi environment—a canonical Dec-POMDP. We find that simply increasing the entropy regularization coefficient of independent PPO to 0.05, combined with a high GAE parameter (λ ≈ 0.9) and an RNN policy architecture, significantly improves natural compatibility across policies trained with different random seeds—enabling high-scoring cross-seed collaboration without dedicated communication or coordination mechanisms. This minimal modification achieves new SOTA performance within the standard independent learning framework, substantially outperforming prior methods explicitly designed for coordination. Our key contribution is twofold: (i) we demonstrate that moderate entropy regularization implicitly aligns policy representations in strategy space, endowing independently trained policies with generalizable cooperative capability; and (ii) we highlight inherent limitations of standard policy gradient methods in strongly partially observable, highly interdependent collaborative settings.
📝 Abstract
We find that in Hanabi, one of the most complex and popular benchmarks for zero-shot coordination and ad-hoc teamplay, a standard implementation of independent PPO with a slightly higher entropy coefficient 0.05 instead of the typically used 0.01, achieves a new state-of-the-art in cross-play between different seeds, beating by a significant margin all previous specialized algorithms, which were specifically designed for this setting. We provide an intuition for why sufficiently high entropy regularization ensures that different random seed produce joint policies which are mutually compatible. We also empirically find that a high $λ_{ ext{GAE}}$ around 0.9, and using RNNs instead of just feed-forward layers in the actor-critic architecture, strongly increase inter-seed cross-play. While these results demonstrate the dramatic effect that hyperparameters can have not just on self-play scores but also on cross-play scores, we show that there are simple Dec-POMDPs though, in which standard policy gradient methods with increased entropy regularization are not able to achieve perfect inter-seed cross-play, thus demonstrating the continuing necessity for new algorithms for zero-shot coordination.