BankerToolBench: Evaluating AI Agents in End-to-End Investment Banking Workflows

📅 2026-04-13
📈 Citations: 0
Influential: 0
📄 PDF

career value

207K/year
🤖 AI Summary
Existing AI benchmarks struggle to evaluate real-world performance in high-value, domain-specific workflows. This work introduces the first end-to-end investment banking evaluation benchmark grounded in authentic scenarios from 502 frontline practitioners, requiring AI agents to operate virtual data rooms, invoke industry tools—including market data platforms and SEC databases—and produce deliverables in multiple formats (Excel, PowerPoint, PDF/Word). The benchmark incorporates over 100 automated scoring criteria defined by senior bankers to assess client usability of outputs. Evaluation of nine state-of-the-art models reveals that even the best-performing model (GPT-5.4) fails to meet nearly half of these criteria and achieves 0% client readiness, exposing critical bottlenecks such as cross-artifact consistency.

Technology Category

Application Category

📝 Abstract
Existing AI benchmarks lack the fidelity to assess economically meaningful progress on professional workflows. To evaluate frontier AI agents in a high-value, labor-intensive profession, we introduce BankerToolBench (BTB): an open-source benchmark of end-to-end analytical workflows routinely performed by junior investment bankers. To develop an ecologically valid benchmark grounded in representative work environments, we collaborated with 502 investment bankers from leading firms. BTB requires agents to execute senior banker requests by navigating data rooms, using industry tools (market data platform, SEC filings database), and generating multi-file deliverables--including Excel financial models, PowerPoint pitch decks, and PDF/Word reports. Completing a BTB task takes bankers up to 21 hours, underscoring the economic stakes of successfully delegating this work to AI. BTB enables automated evaluation of any LLM or agent, scoring deliverables against 100+ rubric criteria defined by veteran investment bankers to capture stakeholder utility. Testing 9 frontier models, we find that even the best-performing model (GPT-5.4) fails nearly half of the rubric criteria and bankers rate 0% of its outputs as client-ready. Our failure analysis reveals key obstacles (such as breakdowns in cross-artifact consistency) and improvement directions for agentic AI in high-stakes professional workflows.
Problem

Research questions and friction points this paper is trying to address.

AI benchmarking
investment banking
professional workflows
ecological validity
agent evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

AI agent evaluation
investment banking workflows
ecologically valid benchmark
multi-artifact consistency
professional automation
🔎 Similar Papers
No similar papers found.
Elaine Lau
Elaine Lau
McGill University, Mila, Scale AI
deep learningreinforcement learningnatural language processing
M
Markus Dücker
Handshake AI
R
Ronak Chaudhary
Handshake AI
H
Hui Wen Goh
Handshake AI
R
Rosemary Wei
Handshake AI
V
Vaibhav Kumar
Handshake AI
S
Saed Qunbar
Handshake AI
G
Guram Gogia
Handshake AI
Y
Yi Liu
Handshake AI
S
Scott Millslagle
Handshake AI
N
Nasim Borazjanizadeh
Handshake AI
U
Ulyana Tkachenko
Handshake AI
S
Samuel Eshun Danquah
Handshake AI
C
Collin Schweiker
Handshake AI
V
Vijay Karumathil
Handshake AI
A
Asrith Devalaraju
Handshake AI
V
Varsha Sandadi
Handshake AI
H
Haemi Nam
Handshake AI
P
Punit Arani
Handshake AI
R
Ray Epps
Handshake AI
A
Abdullah Arif
Handshake AI
S
Sahil Bhaiwala
Handshake AI
C
Curtis Northcutt
Handshake AI
Skyler Wang
Skyler Wang
Assistant Professor, McGill University | Research Scientist, Meta
AITechnologyHuman-Computer InteractionEconomic SociologyGender & Sexuality
Anish Athalye
Anish Athalye
Massachusetts Institute of Technology
Computer SystemsProgramming LanguagesMachine Learning
Jonas Mueller
Jonas Mueller
Cleanlab
Trustworthy AIMachine LearningStatisticsComputational Biology
F
Francisco Guzmán
Handshake AI