Fin-RATE: A Real-world Financial Analytics and Tracking Evaluation Benchmark for LLMs on SEC Filings

πŸ“… 2026-02-07
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the limitations of existing financial large language model (LLM) evaluation benchmarks, which fail to capture the complex reasoning capabilities required of professional analysts across multiple documents, entities, and time dimensions, and lack fine-grained attribution of error sources. To bridge this gap, we introduce Fin-RATEβ€”the first benchmark grounded in U.S. SEC filings that simulates real-world financial analysis workflows. It encompasses three task types: fine-grained single-document reasoning, cross-entity thematic comparison, and longitudinal firm tracking. We systematically evaluate 17 prominent models under both retrieved and given-context settings and, for the first time, categorize and quantify errors arising from retrieval, generation, financial reasoning, and contextual understanding. Experiments reveal accuracy drops of 18.60% and 14.35% on longitudinal and cross-entity tasks, respectively, primarily due to comparative hallucinations, temporal misalignment, and entity mismatches, thereby filling a critical void in evaluating complex financial reasoning.

Technology Category

Application Category

πŸ“ Abstract
With the increasing deployment of Large Language Models (LLMs) in the finance domain, LLMs are increasingly expected to parse complex regulatory disclosures. However, existing benchmarks often focus on isolated details, failing to reflect the complexity of professional analysis that requires synthesizing information across multiple documents, reporting periods, and corporate entities. Furthermore, these benchmarks do not disentangle whether errors arise from retrieval failures, generation inaccuracies, domain-specific reasoning mistakes, or misinterpretation of the query or context, making it difficult to precisely diagnose performance bottlenecks. To bridge these gaps, we introduce Fin-RATE, a benchmark built on U.S. Securities and Exchange Commission (SEC) filings and mirroring financial analyst workflows through three pathways: detail-oriented reasoning within individual disclosures, cross-entity comparison under shared topics, and longitudinal tracking of the same firm across reporting periods. We benchmark 17 leading LLMs, spanning open-source, closed-source, and finance-specialized models, under both ground-truth context and retrieval-augmented settings. Results show substantial performance degradation, with accuracy dropping by 18.60\% and 14.35\% as tasks shift from single-document reasoning to longitudinal and cross-entity analysis. This degradation is driven by increased comparison hallucinations, temporal and entity mismatches, and is further reflected in declines in reasoning quality and factual consistency--limitations that existing benchmarks have yet to formally categorize or quantify.
Problem

Research questions and friction points this paper is trying to address.

financial analytics
LLM evaluation
SEC filings
cross-entity comparison
longitudinal tracking
Innovation

Methods, ideas, or system contributions that make the work stand out.

financial benchmark
SEC filings
cross-entity comparison
longitudinal tracking
LLM evaluation
πŸ”Ž Similar Papers
No similar papers found.