Agentick: A Unified Benchmark for General Sequential Decision-Making Agents

📅 2026-05-07

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

This work addresses the absence of a unified benchmark for fairly evaluating diverse paradigms—including reinforcement learning (RL), large language models (LLMs), and vision-language models (VLMs)—on sequential decision-making tasks. The authors propose a multimodal, extensible evaluation platform built on the Gymnasium interface, featuring 37 procedurally generated environments with multiple difficulty levels and decomposed capability dimensions. For the first time, this framework enables comparable assessment of RL agents, LLMs, VLMs, and hybrid architectures within a common setting. The platform is accompanied by oracle policies, supervised fine-tuning datasets, and a composable reasoning toolkit. Across over 90,000 evaluations, no single approach dominates universally; GPT-5 mini achieves the highest aggregate score (0.309), PPO excels in planning and multi-agent tasks, reasoning frameworks boost LLM performance by 3–10×, and ASCII-based observations consistently outperform natural language descriptions.

📝 Abstract

AI agent research spans a wide spectrum: from RL agents that learn from scratch to foundation model agents that leverage pre-trained knowledge, yet no unified benchmark enables fair comparison across these approaches. We present Agentick, a benchmark for sequential decision-making agents designed to evaluate RL, LLM, VLM, hybrid, and human agents on common ground and to power research on the fundamental challenges of sequential decision-making. Agentick provides 37 procedurally generated tasks across six capability categories, four difficulty levels, and five observation modalities, all exposed through a single Gymnasium-compatible interface. The benchmark ships with a Coding API, oracle reference policies for all tasks, pre-built SFT datasets, a composable agent harness, and a live leaderboard. An evaluation spanning 27 configurations and over 90,000 episodes reveals that no single approach dominates: GPT-5 mini leads overall at 0.309 oracle-normalized score while PPO dominates planning and multi-agent tasks; the reasoning harness multiplies LLM performance by 3-10x; and ASCII observations consistently outperform natural language. These findings highlight the substantial room for improvement that remains across all agent paradigms. Agentick's capability-decomposed, multi-modal design provides the empirical infrastructure needed to drive progress toward general autonomous agents, both as an evaluation framework and as a training ground for RL post-training of foundation models in truly sequential environments.

Problem

Research questions and friction points this paper is trying to address.

sequential decision-making

AI agents

unified benchmark

general autonomous agents

agent evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

sequential decision-making

unified benchmark

foundation model agents