Herculean: An Agentic Benchmark for Financial Intelligence

📅 2026-05-14
📈 Citations: 0
Influential: 0
📄 PDF

career value

194K/year
🤖 AI Summary
Existing evaluations of financial intelligence are limited to static capabilities and fail to assess the reliable execution of AI agents in high-stakes, real-world financial workflows. This work proposes the first comprehensive benchmark for agent-based financial intelligence, modeling four canonical workflows—trading, hedging, market insight generation, and auditing—as standardized evaluation environments. Built upon the Multi-agent Capability Protocol (MCP) framework, the benchmark integrates diverse financial tools, dynamic interaction mechanisms, and structured validation criteria to enable end-to-end assessment of heterogeneous agent systems. Experimental results demonstrate that state-of-the-art AI agents perform well in trading and market insight tasks but exhibit significant deficiencies in high-risk scenarios such as hedging and auditing, revealing critical gaps in long-horizon coordination and operational reliability.
📝 Abstract
As AI agents improve, the central question is no longer whether they can solve isolated well-defined financial tasks, but whether they can reliably carry out financial professional work. Existing financial benchmarks offer only a partial view of this ability, as they primarily evaluate static competencies such as question answering, retrieval, summarization, and classification. We introduce Herculean, the first skilled benchmark for agentic financial intelligence spanning four representative workflows, including Trading, Hedging, Market Insights, and Auditing. Each workflow is instantiated as a standardized MCP-based skill environment with its own tools, interaction dynamics, constraints, and success criteria, enabling consistent end-to-end assessment of heterogeneous agent systems. Across frontier agents, we find agents perform relatively well on Trading and Market Insights, but struggle substantially on Hedging and Auditing, where long-horizon coordination, state consistency, and structured verification are critical. Overall, our results point to a key gap in current agents in turning financial reasoning into dependable workflow execution in high-stakes financial workflows.
Problem

Research questions and friction points this paper is trying to address.

financial intelligence
agentic benchmark
workflow execution
AI agents
financial tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

agentic benchmark
financial intelligence
workflow-based evaluation
MCP-based skill environment
end-to-end assessment
🔎 Similar Papers
No similar papers found.
Xueqing Peng
Xueqing Peng
Yale University
Zhuohan Xie
Zhuohan Xie
MBZUAI
Financial AIReasoningNatural Language ProcessingComputational LinguisticsDeep Learning
Yupeng Cao
Yupeng Cao
Stevens Institute of Technology
Natural Language ProcessingMultiModalTrustworthy AI
Haohang Li
Haohang Li
Stevens Institute of Technology
Mechanistic InterpretabilityLanguage ModelLLM AgentFinTech
Lingfei Qian
Lingfei Qian
Yale University
Y
Yan Wang
The Fin AI
V
Vincent Jim Zhang
The Fin AI
H
Huan He
The Fin AI
Xuguang Ai
Xuguang Ai
Biomedical Informatics & Data Science, Yale University
AI in HealthcareData ScienceNLPBiomedical Informatics
Linhai Ma
Linhai Ma
Yale University
Deep learningMedical signal/image analysisConcurrency
R
Ruoyu Xiang
New York University
Yueru He
Yueru He
Columbia University
FinanceLarge Language Models
Y
Yi Han
Georgia Institute of Technology
Shuyao Wang
Shuyao Wang
University of Tennessee
power electronicpower systemmicrogridrenewable energy
Y
Yuqing Guo
The Fin AI
Mingyang Jiang
Mingyang Jiang
Shanghai Jiao Tong University
roboticsintelligent vehiclemachine learning
Y
Yilun Zhao
MBZUAI
Y
Youzhong Dong
The Fin AI
Xiaoyu Wang
Xiaoyu Wang
PhD Candidate, New York University
Federated LearningMultimodal LearningEdge ComputingRecommendation System
Yankai Chen
Yankai Chen
Postdoctoral Associate, Cornell University
Information RetrievalKnowledge MiningLarge Language ModelsAgentic AI
Ye Yuan
Ye Yuan
McGill University, Mila - Quebec AI Institute
Generative ModelingBlack Box OptimizationKnowledge-Centric NLPLLMs
Qiyuan Zhang
Qiyuan Zhang
City University of Hong Kong
NLPData ValuationLLM for Evaluation (Judges and RM)
Fuyuan Lyu
Fuyuan Lyu
McGill University / Mila - Quebec AI Institute
Data-Centric AIData MiningLLM EvaluationInference Scaling
Haolun Wu
Haolun Wu
Researcher at Mila, McGill, Stanford | Prev. intern at Google, DeepMind, MSR
Knowledge RepresentationInformation RetrievalHuman-centric AI
Y
Yonghan Yang
MBZUAI