PTCG-Bench: Can LLM Agents Master Pokémon Trading Card Game?

📅 2026-05-28

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

Existing benchmarks for large language model (LLM) agents struggle to comprehensively evaluate their capabilities in sustained decision-making and self-improvement within complex strategic interactions. This work proposes PTCG-Bench, the first evaluation framework based on the Pokémon Trading Card Game, which systematically assesses LLM agents along two dimensions: in-game strategic decision-making and experience-driven self-evolution. We introduce an interpretable, modular testing framework that effectively decouples agent architecture from underlying model capabilities and employ behavioral ablation analyses to identify key performance factors. Experimental results demonstrate that LLM agents can achieve non-trivial performance in this environment; however, achieving stable and continuous self-evolution remains challenging and is highly sensitive to the design of the evaluation framework.

📝 Abstract

Given a strategically complex board game, human players can quickly learn to devise strategies after playing a few rounds. Autonomous agents require similar capabilities in realistic interactive environments, yet existing agent benchmarks often fail to fully capture such strategic and evolving decision-making scenarios. We present PTCG-Bench, a benchmark built on the Pok'{e}mon Trading Card Game (PTCG) that evaluates LLM agents at two complementary levels: (1) their decision-making performance within a single complex environment, and (2) their ability to self-evolving through accumulated experience. We further include a modular harness ablation to better interpret agent performance without conflating it with model capability. Our experiments show that, although LLM agents can achieve non-trivial gameplay performance, sustained and stable self-evolution remains challenging, and performance is sensitive to harness design. We hope that PTCG-Bench will facilitate future research on harness-aware and self-evolving agents in realistic interactive environments.

Problem

Research questions and friction points this paper is trying to address.

LLM agents

strategic decision-making

self-evolution

interactive environments

agent benchmark

Innovation

Methods, ideas, or system contributions that make the work stand out.

PTCG-Bench

LLM agents

self-evolving