Entropy is all you need for Inter-Seed Cross-Play in Hanabi

📅 2025-11-27

📈 Citations: 0

✨ Influential: 0

career value

215K/year

🤖 AI Summary

This work investigates zero-shot cross-seed coordination in the Hanabi environment—a canonical Dec-POMDP. We find that simply increasing the entropy regularization coefficient of independent PPO to 0.05, combined with a high GAE parameter (λ ≈ 0.9) and an RNN policy architecture, significantly improves natural compatibility across policies trained with different random seeds—enabling high-scoring cross-seed collaboration without dedicated communication or coordination mechanisms. This minimal modification achieves new SOTA performance within the standard independent learning framework, substantially outperforming prior methods explicitly designed for coordination. Our key contribution is twofold: (i) we demonstrate that moderate entropy regularization implicitly aligns policy representations in strategy space, endowing independently trained policies with generalizable cooperative capability; and (ii) we highlight inherent limitations of standard policy gradient methods in strongly partially observable, highly interdependent collaborative settings.

Technology Category

Application Category

📝 Abstract

We find that in Hanabi, one of the most complex and popular benchmarks for zero-shot coordination and ad-hoc teamplay, a standard implementation of independent PPO with a slightly higher entropy coefficient 0.05 instead of the typically used 0.01, achieves a new state-of-the-art in cross-play between different seeds, beating by a significant margin all previous specialized algorithms, which were specifically designed for this setting. We provide an intuition for why sufficiently high entropy regularization ensures that different random seed produce joint policies which are mutually compatible. We also empirically find that a high $λ_{ ext{GAE}}$ around 0.9, and using RNNs instead of just feed-forward layers in the actor-critic architecture, strongly increase inter-seed cross-play. While these results demonstrate the dramatic effect that hyperparameters can have not just on self-play scores but also on cross-play scores, we show that there are simple Dec-POMDPs though, in which standard policy gradient methods with increased entropy regularization are not able to achieve perfect inter-seed cross-play, thus demonstrating the continuing necessity for new algorithms for zero-shot coordination.

Problem

Research questions and friction points this paper is trying to address.

Improves cross-play performance in Hanabi with higher entropy regularization

Explains how entropy ensures compatibility between different random seed policies

Identifies hyperparameters and architecture choices that enhance inter-seed cross-play

Innovation

Methods, ideas, or system contributions that make the work stand out.

Increased entropy coefficient in PPO

High GAE lambda value usage

RNN integration in actor-critic architecture

🔎 Similar Papers

No similar papers found.