Agentar-DeepFinance-300K: A Large-Scale Financial Dataset via Systematic Chain-of-Thought Synthesis Optimization

📅 2025-07-17

📈 Citations: 0

✨ Influential: 0

career value

168K/year

🤖 AI Summary

Existing chain-of-thought (CoT) synthesis methods for financial reasoning suffer from shallow sampling, narrow knowledge coverage, and insufficient systematic evaluation. To address these limitations, we propose CoT Cube—a novel framework that introduces the first large-scale financial reasoning dataset and integrates multi-perspective knowledge extraction (MKE) with self-correcting rewriting (SCR) to generate high-quality, knowledge-comprehensive CoT trajectories. Through rigorous ablation studies, we identify key factors governing CoT effectiveness in finance. Our approach synergistically combines large language models, CoT distillation, and MKE/SCR optimization, achieving significant improvements across multiple financial benchmarks (average +5.2% accuracy). The released dataset and open-source framework establish a new benchmark and practical toolkit for advancing deep reasoning research in finance.

Technology Category

Application Category

📝 Abstract

Recent advancements in large language models (LLMs) have demonstrated remarkable general reasoning capabilities, holding significant potential for applications in the financial domain, a field that requires robust and reliable reasoning. It has been demonstrated that distilling high-quality chain-of-thought (CoT) rationales from advanced general reasoning models offers a promising and efficient path to the financial reasoning model. However, existing CoT synthesis methods suffer from shallow CoT sampling, leaving the question of how to construct a well-designed knowledge space for finance reasoning unexplored. In this paper, we present extbf{Agentar-DeepFinance-300K }, a large-scale financial reasoning dataset characterized by its systematic CoT synthesis optimization. We first introduce a comprehensive CoT synthesis pipeline featuring Multi-perspective Knowledge Extraction (MKE) and Self-Corrective Rewriting (SCR) to generate exhaustive and deep financial reasoning trajectories. Furthermore, a systematic investigation, termed CoT Cube, is conducted to analyze critical factors that influence CoT effectiveness, such as necessity, length and synthesizer, yielding valuable insights for high-quality financial CoT construction. Experiments demonstrate that models trained on our Agentar-DeepFinance-300K achieve significant improvements on financial benchmarks. We publicly release Agentar-DeepFinance-300K , hoping to advance the research in financial reasoning models.

Problem

Research questions and friction points this paper is trying to address.

Enhancing financial reasoning models via systematic CoT synthesis optimization

Addressing shallow CoT sampling in existing financial reasoning methods

Constructing a comprehensive knowledge space for financial reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Systematic CoT synthesis optimization for finance

Multi-perspective Knowledge Extraction pipeline

Self-Corrective Rewriting for deep reasoning

🔎 Similar Papers

Global Neural Networks and The Data Scaling Effect in Financial Time Series Forecasting