Benchmarking the Limits of In-Context Reinforcement Learning for Ad-Hoc Teamwork

📅 2026-05-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenges of ad-hoc teamwork—specifically partial observability and the complexity of inferring unknown teammates’ strategies—by introducing ICRL4AHT, the first large-scale, reproducible benchmark for in-context reinforcement learning (ICRL) in this setting. Built upon an efficient JAX-based implementation of Overcooked-V2, the benchmark features a diverse library of teammate policies and supports end-to-end training as well as evaluation under distributional shift. Systematic evaluation of history-conditioned ICRL approaches, including Algorithm Distillation and Decision-Pretrained Transformers, reveals that current methods often underperform even random policies when encountering unseen teammates or kitchen layouts. This highlights a critical bottleneck in strategic reasoning for multi-agent coordination and underscores the urgent need for novel ICRL algorithms tailored to ad-hoc teamwork scenarios.
📝 Abstract
In-Context Reinforcement Learning (ICRL) has enabled foundation agents to adapt instantaneously to novel tasks, yet its efficacy in Ad-Hoc Teamwork (AHT)-where coordination with unknown partners is required-remains unexplored. To rigorously evaluate this, we introduce a large-scale benchmark ICRL4AHT, built upon a high-throughput JAX implementation of Overcooked-V2. Our benchmark includes a large, diverse teammate suite spanning both RL and heuristic policies, enabling controlled train-test shifts, and provides a reproducible end-to-end pipeline for teammate generation, learning-history collection, dataset construction, and online multi-episode evaluation. We evaluate representative history-conditioned ICRL algorithms, including Algorithm Distillation (AD) and Decision-Pretrained Transformer (DPT), across millions of transitions. Results reveal notable limitations: contrary to their success in single-agent domains, these baselines fail to exhibit robust test-time adaptation in multi-agent settings. Specifically, these methods frequently underperform random baselines across both unseen teammate and unseen layout tracks, with no clear in-context improvement over long horizons. These findings highlight the challenges of strategic inference under partial observability within the OvercookedV2 AHT protocol, establishing our benchmark as a critical testbed for next-generation coordination algorithms.
Problem

Research questions and friction points this paper is trying to address.

In-Context Reinforcement Learning
Ad-Hoc Teamwork
Multi-Agent Coordination
Test-Time Adaptation
Partial Observability
Innovation

Methods, ideas, or system contributions that make the work stand out.

In-Context Reinforcement Learning
Ad-Hoc Teamwork
Benchmarking
Multi-Agent Coordination
Algorithm Distillation
Y
Yuheng Jing
C2DL, Institute of Automation, Chinese Academy of Sciences; School of Artificial Intelligence, University of Chinese Academy of Sciences
Kai Li
Kai Li
University of Chinese Academy of Sciences & City University of Hong Kong
Computer VisionMultimodal Language ModelRemote Sensing
Z
Ziwen Zhang
C2DL, Institute of Automation, Chinese Academy of Sciences; School of Artificial Intelligence, University of Chinese Academy of Sciences
Jiajun Zhang
Jiajun Zhang
Institute of Automation Chinese Academy of Sciences
Natural Language ProcessingLarge Language ModelsMultimodal Information Processing
Zeyao Ma
Zeyao Ma
Renmin University of China
Large Language ModelCode GenerationReasoningTable Processing
Jiaxi Yang
Jiaxi Yang
PhD student, SIAT, CAS, China
Natural Language ProcessingLarge Language Model
Lei Zhang
Lei Zhang
Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences
Agentic CodingReinforcement LearningLarge Language Model
Zhe Wu
Zhe Wu
Tsinghua University
Reinforement Learning
J
Jinmin He
C2DL, Institute of Automation, Chinese Academy of Sciences; School of Artificial Intelligence, University of Chinese Academy of Sciences
J
Junliang Xing
Department of Computer Science and Technology, Tsinghua University
J
Jian Cheng
C2DL, Institute of Automation, Chinese Academy of Sciences; School of Future Technology, University of Chinese Academy of Sciences; AiRiA.