The PokeAgent Challenge: Competitive and Long-Context Learning at Scale

📅 2026-03-16

📈 Citations: 0

✨ Influential: 0

career value

256K/year

🤖 AI Summary

Existing AI benchmarks struggle to simultaneously address core challenges such as partial observability, long-horizon planning, and multi-agent decision-making. This work proposes the first scalable and reproducible large-scale decision-making benchmark based on Pokémon battles and RPG speedrunning, targeting strategic generalization and long-sequence decision tasks respectively. It provides over 20 million trajectories, diverse baseline methods, and a standardized evaluation framework. By integrating heuristic search, reinforcement learning, large language models (LLMs), and multi-agent architectures, the authors develop an open-source, modular evaluation system that reveals significant performance gaps between LLMs, reinforcement learning agents, and human experts—highlighting capability dimensions orthogonal to existing LLM evaluations. The benchmark has already underpinned the NeurIPS 2025 competition, attracting over 100 participating teams, and features a continuously updated online leaderboard to advance research in complex decision intelligence.

Technology Category

Application Category

📝 Abstract

We present the PokeAgent Challenge, a large-scale benchmark for decision-making research built on Pokemon's multi-agent battle system and expansive role-playing game (RPG) environment. Partial observability, game-theoretic reasoning, and long-horizon planning remain open problems for frontier AI, yet few benchmarks stress all three simultaneously under realistic conditions. PokeAgent targets these limitations at scale through two complementary tracks: our Battling Track, which calls for strategic reasoning and generalization under partial observability in competitive Pokemon battles, and our Speedrunning Track, which requires long-horizon planning and sequential decision-making in the Pokemon RPG. Our Battling Track supplies a dataset of 20M+ battle trajectories alongside a suite of heuristic, RL, and LLM-based baselines capable of high-level competitive play. Our Speedrunning Track provides the first standardized evaluation framework for RPG speedrunning, including an open-source multi-agent orchestration system for modular, reproducible comparisons of harness-based LLM approaches. Our NeurIPS 2025 competition validates both the quality of our resources and the research community's interest in Pokemon, with over 100 teams competing across both tracks and winning solutions detailed in our paper. Participant submissions and our baselines reveal considerable gaps between generalist (LLM), specialist (RL), and elite human performance. Analysis against the BenchPress evaluation matrix shows that Pokemon battling is nearly orthogonal to standard LLM benchmarks, measuring capabilities not captured by existing suites and positioning Pokemon as an unsolved benchmark that can drive RL and LLM research forward. We transition to a living benchmark with a live leaderboard for Battling and self-contained evaluation for Speedrunning at https://pokeagentchallenge.com.

Problem

Research questions and friction points this paper is trying to address.

partial observability

game-theoretic reasoning

long-horizon planning

multi-agent decision-making

competitive AI

Innovation

Methods, ideas, or system contributions that make the work stand out.

long-horizon planning

partial observability

multi-agent decision-making