BizFinBench: A Business-Driven Real-World Financial Benchmark for Evaluating LLMs

📅 2025-05-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
The financial domain lacks high-precision, scenario-specific large language model (LLM) evaluation benchmarks. Method: This paper introduces BizFinBench—the first Chinese LLM benchmark grounded in authentic financial scenarios—covering five dimensions (numerical computation, reasoning, information extraction, predictive recognition, and knowledge-based QA) and nine fine-grained tasks, with 6,781 human-annotated samples and a dual subjective-objective evaluation framework. It proposes IteraJudge, an iterative LLM self-evaluation method to mitigate bias, establishing a business-aligned, multi-dimensional, fine-grained, and real-context-driven evaluation paradigm. Contribution/Results: Horizontal evaluation across 25 mainstream models reveals systematic weaknesses in cross-concept complex reasoning. DeepSeek-R1 achieves top performance in numerical computation (64.04) and information extraction (71.46); ChatGPT-o3 excels in reasoning (83.58); open-source models lag closed-source counterparts by an average of 19.49 points.

Technology Category

Application Category

📝 Abstract
Large language models excel in general tasks, yet assessing their reliability in logic-heavy, precision-critical domains like finance, law, and healthcare remains challenging. To address this, we introduce BizFinBench, the first benchmark specifically designed to evaluate LLMs in real-world financial applications. BizFinBench consists of 6,781 well-annotated queries in Chinese, spanning five dimensions: numerical calculation, reasoning, information extraction, prediction recognition, and knowledge-based question answering, grouped into nine fine-grained categories. The benchmark includes both objective and subjective metrics. We also introduce IteraJudge, a novel LLM evaluation method that reduces bias when LLMs serve as evaluators in objective metrics. We benchmark 25 models, including both proprietary and open-source systems. Extensive experiments show that no model dominates across all tasks. Our evaluation reveals distinct capability patterns: (1) In Numerical Calculation, Claude-3.5-Sonnet (63.18) and DeepSeek-R1 (64.04) lead, while smaller models like Qwen2.5-VL-3B (15.92) lag significantly; (2) In Reasoning, proprietary models dominate (ChatGPT-o3: 83.58, Gemini-2.0-Flash: 81.15), with open-source models trailing by up to 19.49 points; (3) In Information Extraction, the performance spread is the largest, with DeepSeek-R1 scoring 71.46, while Qwen3-1.7B scores 11.23; (4) In Prediction Recognition, performance variance is minimal, with top models scoring between 39.16 and 50.00. We find that while current LLMs handle routine finance queries competently, they struggle with complex scenarios requiring cross-concept reasoning. BizFinBench offers a rigorous, business-aligned benchmark for future research. The code and dataset are available at https://github.com/HiThink-Research/BizFinBench.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLM reliability in logic-heavy financial applications
Assessing performance gaps in financial tasks across diverse LLMs
Reducing evaluation bias in objective financial benchmarking metrics
Innovation

Methods, ideas, or system contributions that make the work stand out.

BizFinBench: first financial benchmark for LLMs
IteraJudge method reduces LLM evaluation bias
Evaluates 25 models across five financial dimensions
🔎 Similar Papers
No similar papers found.