FinanceReasoning: Benchmarking Financial Numerical Reasoning More Credible, Comprehensive and Challenging

📅 2025-06-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses critical gaps in evaluating Large Reasoning Models (LRMs) on financial numerical reasoning: lack of standardized benchmarks, low result verifiability, incomplete coverage of domain concepts, and insufficient task difficulty. To this end, we introduce FinNumBench—the first high-fidelity, comprehensive, and challenging financial numerical reasoning benchmark specifically designed for LRMs. It comprises 3,133 executable Python functions and 238 multi-formula collaborative reasoning problems. We propose a threefold enhancement framework: (1) Verifiability—908 problems include executable program solutions and rigorous, deterministic scoring criteria; (2) Comprehensiveness—covers 67.8% of core financial concepts; (3) Challenge—emphasizes multi-step, cross-formula symbolic reasoning. Integrating Program-of-Thought (PoT), Reasoner-Programmer co-modeling, and fine-grained financial knowledge injection, we achieve substantial performance gains: GPT-4o reaches 91.6%, OpenAI o1+PoT achieves 89.1%, and DeepSeek-R1 with our co-modeling approach improves by 4.6 percentage points. Our analysis further uncovers persistent numerical precision bottlenecks.

Technology Category

Application Category

📝 Abstract
We introduce FinanceReasoning, a novel benchmark designed to evaluate the reasoning capabilities of large reasoning models (LRMs) in financial numerical reasoning problems. Compared to existing benchmarks, our work provides three key advancements. (1) Credibility: We update 15.6% of the questions from four public datasets, annotating 908 new questions with detailed Python solutions and rigorously refining evaluation standards. This enables an accurate assessment of the reasoning improvements of LRMs. (2) Comprehensiveness: FinanceReasoning covers 67.8% of financial concepts and formulas, significantly surpassing existing datasets. Additionally, we construct 3,133 Python-formatted functions, which enhances LRMs' financial reasoning capabilities through refined knowledge (e.g., 83.2% $ ightarrow$ 91.6% for GPT-4o). (3) Challenge: Models are required to apply multiple financial formulas for precise numerical reasoning on 238 Hard problems. The best-performing model (i.e., OpenAI o1 with PoT) achieves 89.1% accuracy, yet LRMs still face challenges in numerical precision. We demonstrate that combining Reasoner and Programmer models can effectively enhance LRMs' performance (e.g., 83.2% $ ightarrow$ 87.8% for DeepSeek-R1). Our work paves the way for future research on evaluating and improving LRMs in domain-specific complex reasoning tasks.
Problem

Research questions and friction points this paper is trying to address.

Evaluating financial numerical reasoning in large models
Enhancing financial concept coverage and formula accuracy
Improving model performance on complex financial problems
Innovation

Methods, ideas, or system contributions that make the work stand out.

Updated questions with Python solutions for credibility
Covered 67.8% financial concepts for comprehensiveness
Required multiple financial formulas for challenge
🔎 Similar Papers
No similar papers found.
Z
Zichen Tang
Beijing University of Posts and Telecommunications
Z
Ziyan Ma
Beijing University of Posts and Telecommunications
H
Haoyang He
Beijing University of Posts and Telecommunications
J
Jiacheng Liu
Beijing University of Posts and Telecommunications
Z
Zhongjun Yang
Beijing University of Posts and Telecommunications
Z
Zihua Rong
Beijing University of Posts and Telecommunications
Rongjin Li
Rongjin Li
Xiamen University, VoiceAI
speaker recognitionspeech enhancementdeep learning
K
Kun Ji
Beijing University of Posts and Telecommunications
Qing Huang
Qing Huang
Chinese Academy of Science
Material Editing
X
Xinyang Hu
Beijing University of Posts and Telecommunications
Y
Yang Liu
Beijing University of Posts and Telecommunications
Q
Qianhe Zheng
Beijing University of Posts and Telecommunications