FrontierFinance: A Long-Horizon Computer-Use Benchmark of Real-World Financial Tasks

📅 2026-04-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing benchmarks struggle to effectively evaluate the capability of large language models (LLMs) to perform authentic, high-stakes, knowledge-intensive tasks in finance and lack mechanisms for accountability. This work proposes the first long-horizon financial modeling benchmark grounded in real-world industry workflows, encompassing five core models and 25 complex tasks—each requiring on average over 18 hours of professional effort—designed and assessed by finance practitioners. Leveraging expert-driven, structured scoring rubrics and human performance baselines, the study systematically evaluates the quality and client-readiness of AI-generated outputs. Results demonstrate that even state-of-the-art LLMs significantly underperform human experts in both task execution and the practical usability of their deliverables.
📝 Abstract
As concerns surrounding AI-driven labor displacement intensify in knowledge-intensive sectors, existing benchmarks fail to measure performance on tasks that define practical professional expertise. Finance, in particular, has been identified as a domain with high AI exposure risk, yet lacks robust benchmarks to track real-world developments. This gap is compounded by the absence of clear accountability mechanisms in current Large Language Model (LLM) deployments. To address this, we introduce FrontierFinance, a long-horizon benchmark of 25 complex financial modeling tasks across five core finance models, requiring an average of over 18 hours of skilled human labor per task to complete. Developed with financial professionals, the benchmark reflects industry-standard financial modeling workflows and is paired with detailed rubrics for structured evaluation. We engage human experts to define the tasks, create rubrics, grade LLMs, and perform the tasks themselves as human baselines. We demonstrate that our human experts both receive higher scores on average, and are more likely to provide client-ready outputs than current state-of-the-art systems.
Problem

Research questions and friction points this paper is trying to address.

AI benchmarking
financial modeling
long-horizon tasks
professional expertise
LLM evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

long-horizon benchmark
financial modeling
LLM evaluation
human expert baseline
professional AI accountability
🔎 Similar Papers
No similar papers found.
Michael Krumdick
Michael Krumdick
Kensho
Artificial IntelligenceDeep Learning
Varshini Reddy
Varshini Reddy
Kensho Technologies
Machine LearningComputer VisionReinforcement LearningNLPData Science
S
Shivani Chaudhary
Kensho Technologies
W
William Day
Kensho Technologies
M
Maarij Ahmed
S&P Global
H
Hayan Haqqi
S&P Global
M
Muhammad Ahsen Fahim
S&P Global
H
Hanzallah Amjad
S&P Global
A
Ahmad Orakzai
S&P Global
A
Aqsa Gul
S&P Global
Chris Tanner
Chris Tanner
Head of R&D at Kensho; Lecturer at MIT
Natural Language ProcessingMachine Learning