Ad-Hoc Human-AI Coordination Challenge

📅 2025-06-26

📈 Citations: 0

✨ Influential: 0

career value

220K/year

🤖 AI Summary

To address the core challenges of high human evaluation cost, data scarcity, and poor generalizability in Hanabi human-AI collaboration, this paper proposes a scalable evaluation framework grounded in human behavioral modeling. Methodologically, we construct a high-fidelity “human agent” trained on 3,079 high-quality real-game episodes via integrated behavioral cloning, sequence modeling, and multi-agent reinforcement learning; we further design a constrained communication protocol and implement a controlled online evaluation system to ensure fairness. Key contributions include: (1) the first agent-driven evaluation paradigm for human-AI collaboration; (2) the release of the largest publicly available Hanabi human-AI collaboration dataset to date; and (3) empirical validation—under two- and three-player settings—that our agent policies closely replicate human behavior (p < 0.01), reduce evaluation cost by over 90%, and achieve reproducibility with sub-2% measurement error.

Technology Category

Application Category

📝 Abstract

Achieving seamless coordination between AI agents and humans is crucial for real-world applications, yet it remains a significant open challenge. Hanabi is a cooperative card game featuring imperfect information, constrained communication, theory of mind requirements, and coordinated action -- making it an ideal testbed for human-AI coordination. However, its use for human-AI interaction has been limited by the challenges of human evaluation. In this work, we introduce the Ad-Hoc Human-AI Coordination Challenge (AH2AC2) to overcome the constraints of costly and difficult-to-reproduce human evaluations. We develop extit{human proxy agents} on a large-scale human dataset that serve as robust, cheap, and reproducible human-like evaluation partners in AH2AC2. To encourage the development of data-efficient methods, we open-source a dataset of 3,079 games, deliberately limiting the amount of available human gameplay data. We present baseline results for both two- and three- player Hanabi scenarios. To ensure fair evaluation, we host the proxy agents through a controlled evaluation system rather than releasing them publicly. The code is available at href{https://github.com/FLAIROx/ah2ac2}{https://github.com/FLAIROx/ah2ac2}.

Problem

Research questions and friction points this paper is trying to address.

Overcoming costly human evaluation in AI coordination

Developing human proxy agents for reproducible testing

Addressing limited human gameplay data for training

Innovation

Methods, ideas, or system contributions that make the work stand out.

Human proxy agents for evaluation

Limited human gameplay dataset

Controlled evaluation system

🔎 Similar Papers

CREW: Facilitating Human-AI Teaming Research

2024-07-31arXiv.orgCitations: 2

Microsoft

$6,710 -

San Francisco Bay area / New York City metropolitan area

AI Research Scientist - FAIR Social Intelligence