APEX-Agents

📅 2026-01-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study evaluates the capability of AI agents to perform long-horizon, cross-application complex tasks in professional service domains such as investment banking, management consulting, and corporate law. To this end, we introduce Archipelago, the first high-fidelity benchmark tailored to the professional services industry, comprising 480 real-world task scenarios derived from authentic workflows, along with an automated execution framework and human-defined scoring rubrics. Using the Pass@1 metric, we benchmark leading AI agents and find that Gemini 3 Flash (with Thinking=High) achieves the highest performance at 24.0%. The complete dataset and evaluation infrastructure are publicly released to establish a new standard for agent research in specialized professional domains.

Technology Category

Application Category

📝 Abstract
We introduce the AI Productivity Index for Agents (APEX-Agents), a benchmark for assessing whether AI agents can execute long-horizon, cross-application tasks created by investment banking analysts, management consultants, and corporate lawyers. APEX-Agents requires agents to navigate realistic work environments with files and tools. We test eight agents for the leaderboard using Pass@1. Gemini 3 Flash (Thinking=High) achieves the highest score of 24.0%, followed by GPT-5.2 (Thinking=High), Claude Opus 4.5 (Thinking=High), and Gemini 3 Pro (Thinking=High). We open source the APEX-Agents benchmark (n=480) with all prompts, rubrics, gold outputs, files, and metadata. We also open-source Archipelago, our infrastructure for agent execution and evaluation.
Problem

Research questions and friction points this paper is trying to address.

AI agents
long-horizon tasks
cross-application tasks
benchmark
realistic work environments
Innovation

Methods, ideas, or system contributions that make the work stand out.

AI agents
long-horizon tasks
cross-application benchmark
realistic work environments
open-source evaluation infrastructure
🔎 Similar Papers
No similar papers found.
Bertie Vidgen
Bertie Vidgen
Oxford, Mercor
EvalsMCP + RAGAlignment + SafetyContent Moderation
A
Austin Mann
A
Abby Fennelly
J
Jonathan Stanly
L
Lucas Rothman
M
M. Burstein
J
Julien Benchek
D
David Ostrofsky
A
Anirudh Ravichandran
D
Debnil Sur
N. Venugopal
N. Venugopal
PES UNIVERSITY
POWER ELECTRONICSIMAGE PROCESSINGVIDEO PROCESSING
A
A. Hsia
Isaac Robinson
Isaac Robinson
Unknown affiliation
C
Calix Huang
O
Olivia Varones
D
Daniyal Khan
M
Michael R. Haines
Z
Zach Richards
C
Chirag Mahapatra
B
Brendan Foody
Osvald Nitski
Osvald Nitski
Product Manager at Mercor
Machine LearningNatural Language ProcessingArtificial Intelligence