🤖 AI Summary
Existing evaluations of financial intelligence are limited to static capabilities and fail to assess the reliable execution of AI agents in high-stakes, real-world financial workflows. This work proposes the first comprehensive benchmark for agent-based financial intelligence, modeling four canonical workflows—trading, hedging, market insight generation, and auditing—as standardized evaluation environments. Built upon the Multi-agent Capability Protocol (MCP) framework, the benchmark integrates diverse financial tools, dynamic interaction mechanisms, and structured validation criteria to enable end-to-end assessment of heterogeneous agent systems. Experimental results demonstrate that state-of-the-art AI agents perform well in trading and market insight tasks but exhibit significant deficiencies in high-risk scenarios such as hedging and auditing, revealing critical gaps in long-horizon coordination and operational reliability.
📝 Abstract
As AI agents improve, the central question is no longer whether they can solve isolated well-defined financial tasks, but whether they can reliably carry out financial professional work. Existing financial benchmarks offer only a partial view of this ability, as they primarily evaluate static competencies such as question answering, retrieval, summarization, and classification. We introduce Herculean, the first skilled benchmark for agentic financial intelligence spanning four representative workflows, including Trading, Hedging, Market Insights, and Auditing. Each workflow is instantiated as a standardized MCP-based skill environment with its own tools, interaction dynamics, constraints, and success criteria, enabling consistent end-to-end assessment of heterogeneous agent systems. Across frontier agents, we find agents perform relatively well on Trading and Market Insights, but struggle substantially on Hedging and Auditing, where long-horizon coordination, state consistency, and structured verification are critical. Overall, our results point to a key gap in current agents in turning financial reasoning into dependable workflow execution in high-stakes financial workflows.