Ad-Hoc Human-AI Coordination Challenge

📅 2025-06-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the core challenges of high human evaluation cost, data scarcity, and poor generalizability in Hanabi human-AI collaboration, this paper proposes a scalable evaluation framework grounded in human behavioral modeling. Methodologically, we construct a high-fidelity “human agent” trained on 3,079 high-quality real-game episodes via integrated behavioral cloning, sequence modeling, and multi-agent reinforcement learning; we further design a constrained communication protocol and implement a controlled online evaluation system to ensure fairness. Key contributions include: (1) the first agent-driven evaluation paradigm for human-AI collaboration; (2) the release of the largest publicly available Hanabi human-AI collaboration dataset to date; and (3) empirical validation—under two- and three-player settings—that our agent policies closely replicate human behavior (p < 0.01), reduce evaluation cost by over 90%, and achieve reproducibility with sub-2% measurement error.

Technology Category

Application Category

📝 Abstract
Achieving seamless coordination between AI agents and humans is crucial for real-world applications, yet it remains a significant open challenge. Hanabi is a cooperative card game featuring imperfect information, constrained communication, theory of mind requirements, and coordinated action -- making it an ideal testbed for human-AI coordination. However, its use for human-AI interaction has been limited by the challenges of human evaluation. In this work, we introduce the Ad-Hoc Human-AI Coordination Challenge (AH2AC2) to overcome the constraints of costly and difficult-to-reproduce human evaluations. We develop extit{human proxy agents} on a large-scale human dataset that serve as robust, cheap, and reproducible human-like evaluation partners in AH2AC2. To encourage the development of data-efficient methods, we open-source a dataset of 3,079 games, deliberately limiting the amount of available human gameplay data. We present baseline results for both two- and three- player Hanabi scenarios. To ensure fair evaluation, we host the proxy agents through a controlled evaluation system rather than releasing them publicly. The code is available at href{https://github.com/FLAIROx/ah2ac2}{https://github.com/FLAIROx/ah2ac2}.
Problem

Research questions and friction points this paper is trying to address.

Overcoming costly human evaluation in AI coordination
Developing human proxy agents for reproducible testing
Addressing limited human gameplay data for training
Innovation

Methods, ideas, or system contributions that make the work stand out.

Human proxy agents for evaluation
Limited human gameplay dataset
Controlled evaluation system
🔎 Similar Papers
No similar papers found.
T
Tin Dizdarević
FLAIR, University of Oxford, Oxford, UK
R
Ravi Hammond
FLAIR, University of Oxford, Oxford, UK
T
Tobias Gessler
FLAIR, University of Oxford, Oxford, UK
Anisoara Calinescu
Anisoara Calinescu
Department of Computer Science & Reuben College, University of Oxford
Complex SystemsEmergenceEntropyAgent-Based ModellingSupply Chain and Manufacturing systems
J
Jonathan Cook
FLAIR, University of Oxford, Oxford, UK
Matteo Gallici
Matteo Gallici
Universitat Politècnica de Catalunya
Artificial IntelligenceReinforcement LearningMulti-Agent Reinforcement Learning
Andrei Lupu
Andrei Lupu
University of Oxford & FAIR, Meta AI
Reinforcement LearningMulti-Agent RL
J
Jakob Nicolaus Foerster
FLAIR, University of Oxford, Oxford, UK