FinChain: A Symbolic Benchmark for Verifiable Chain-of-Thought Financial Reasoning

📅 2025-06-03

📈 Citations: 0

✨ Influential: 0

career value

172K/year

🤖 AI Summary

Existing financial reasoning benchmarks (e.g., FinQA, ConvFinQA) supervise only final answers, lacking verifiable evaluation of multi-step symbolic reasoning processes—hindering downstream model improvement. Method: We introduce FinChain, the first chain-of-thought verifiable financial symbolic benchmark: it spans 12 domains and 54 topics, each with five parameterized symbolic templates and executable Python traces, enabling automated data generation and cross-domain transfer. We further propose ChainEval, the first evaluation framework that simultaneously quantifies final-answer accuracy and intermediate-step logical consistency, while supporting high-quality synthetic data generation. Contribution/Results: Comprehensive evaluation across 30 mainstream large language models reveals pervasive intermediate reasoning errors. All templates, code, and tools are publicly released.

Technology Category

Application Category

📝 Abstract

Multi-step symbolic reasoning is critical for advancing downstream performance on financial tasks. Yet, benchmarks for systematically evaluating this capability are lacking. Existing datasets like FinQA and ConvFinQA supervise only final numerical answers, without assessing intermediate reasoning steps. To address this, we introduce FinChain, the first symbolic benchmark designed for verifiable Chain-of- Thought (CoT) financial reasoning. Spanning 54 topics across 12 financial domains, Fin- Chain offers five parameterized templates per topic, each varying in reasoning complexity and domain expertise required. Each dataset instance includes an executable Python trace, enabling automatic generation of extensive training data and easy adaptation to other domains. We also introduce ChainEval, a new metric for automatic evaluation of both final answers and intermediate reasoning. Benchmarking 30 LLMs on our dataset, we find that even state-of-the-art models have considerable room for improvement in multi-step financial reasoning. All templates and evaluation metrics for FinChain are available at https: //github.com/mbzuai-nlp/finchain.

Problem

Research questions and friction points this paper is trying to address.

Lacks benchmarks for evaluating multi-step financial reasoning

Existing datasets do not assess intermediate reasoning steps

Need for verifiable Chain-of-Thought in financial tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Symbolic benchmark for verifiable financial reasoning

Parameterized templates for diverse reasoning complexity

Automatic evaluation with executable Python traces

🔎 Similar Papers

Theorem-Carrying-Transaction: Runtime Certification to Ensure Safety for Smart Contract Transactions