Agentick: A Unified Benchmark for General Sequential Decision-Making Agents

📅 2026-05-07
📈 Citations: 0
Influential: 0
📄 PDF

career value

201K/year
🤖 AI Summary
This work addresses the absence of a unified benchmark for fairly evaluating diverse paradigms—including reinforcement learning (RL), large language models (LLMs), and vision-language models (VLMs)—on sequential decision-making tasks. The authors propose a multimodal, extensible evaluation platform built on the Gymnasium interface, featuring 37 procedurally generated environments with multiple difficulty levels and decomposed capability dimensions. For the first time, this framework enables comparable assessment of RL agents, LLMs, VLMs, and hybrid architectures within a common setting. The platform is accompanied by oracle policies, supervised fine-tuning datasets, and a composable reasoning toolkit. Across over 90,000 evaluations, no single approach dominates universally; GPT-5 mini achieves the highest aggregate score (0.309), PPO excels in planning and multi-agent tasks, reasoning frameworks boost LLM performance by 3–10×, and ASCII-based observations consistently outperform natural language descriptions.
📝 Abstract
AI agent research spans a wide spectrum: from RL agents that learn from scratch to foundation model agents that leverage pre-trained knowledge, yet no unified benchmark enables fair comparison across these approaches. We present Agentick, a benchmark for sequential decision-making agents designed to evaluate RL, LLM, VLM, hybrid, and human agents on common ground and to power research on the fundamental challenges of sequential decision-making. Agentick provides 37 procedurally generated tasks across six capability categories, four difficulty levels, and five observation modalities, all exposed through a single Gymnasium-compatible interface. The benchmark ships with a Coding API, oracle reference policies for all tasks, pre-built SFT datasets, a composable agent harness, and a live leaderboard. An evaluation spanning 27 configurations and over 90,000 episodes reveals that no single approach dominates: GPT-5 mini leads overall at 0.309 oracle-normalized score while PPO dominates planning and multi-agent tasks; the reasoning harness multiplies LLM performance by 3-10x; and ASCII observations consistently outperform natural language. These findings highlight the substantial room for improvement that remains across all agent paradigms. Agentick's capability-decomposed, multi-modal design provides the empirical infrastructure needed to drive progress toward general autonomous agents, both as an evaluation framework and as a training ground for RL post-training of foundation models in truly sequential environments.
Problem

Research questions and friction points this paper is trying to address.

sequential decision-making
AI agents
unified benchmark
general autonomous agents
agent evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

sequential decision-making
unified benchmark
foundation model agents
procedurally generated tasks
multi-modal observation
🔎 Similar Papers
No similar papers found.