Benchmarking the Limits of In-Context Reinforcement Learning for Ad-Hoc Teamwork

📅 2026-05-23

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

This work addresses the challenges of ad-hoc teamwork—specifically partial observability and the complexity of inferring unknown teammates’ strategies—by introducing ICRL4AHT, the first large-scale, reproducible benchmark for in-context reinforcement learning (ICRL) in this setting. Built upon an efficient JAX-based implementation of Overcooked-V2, the benchmark features a diverse library of teammate policies and supports end-to-end training as well as evaluation under distributional shift. Systematic evaluation of history-conditioned ICRL approaches, including Algorithm Distillation and Decision-Pretrained Transformers, reveals that current methods often underperform even random policies when encountering unseen teammates or kitchen layouts. This highlights a critical bottleneck in strategic reasoning for multi-agent coordination and underscores the urgent need for novel ICRL algorithms tailored to ad-hoc teamwork scenarios.

📝 Abstract

In-Context Reinforcement Learning (ICRL) has enabled foundation agents to adapt instantaneously to novel tasks, yet its efficacy in Ad-Hoc Teamwork (AHT)-where coordination with unknown partners is required-remains unexplored. To rigorously evaluate this, we introduce a large-scale benchmark ICRL4AHT, built upon a high-throughput JAX implementation of Overcooked-V2. Our benchmark includes a large, diverse teammate suite spanning both RL and heuristic policies, enabling controlled train-test shifts, and provides a reproducible end-to-end pipeline for teammate generation, learning-history collection, dataset construction, and online multi-episode evaluation. We evaluate representative history-conditioned ICRL algorithms, including Algorithm Distillation (AD) and Decision-Pretrained Transformer (DPT), across millions of transitions. Results reveal notable limitations: contrary to their success in single-agent domains, these baselines fail to exhibit robust test-time adaptation in multi-agent settings. Specifically, these methods frequently underperform random baselines across both unseen teammate and unseen layout tracks, with no clear in-context improvement over long horizons. These findings highlight the challenges of strategic inference under partial observability within the OvercookedV2 AHT protocol, establishing our benchmark as a critical testbed for next-generation coordination algorithms.

Problem

Research questions and friction points this paper is trying to address.

In-Context Reinforcement Learning

Ad-Hoc Teamwork

Multi-Agent Coordination

Test-Time Adaptation

Partial Observability

Innovation

Methods, ideas, or system contributions that make the work stand out.

In-Context Reinforcement Learning

Ad-Hoc Teamwork

Benchmarking