🤖 AI Summary
Financial large language models (LLMs) suffer from insufficient domain knowledge and weak structured reasoning capabilities, limiting their performance on complex financial tasks. To address this, we propose FEVO, a multi-stage enhancement framework introducing a novel three-tier training paradigm: “knowledge expansion—task alignment—reasoning evolution.” First, domain-specific knowledge is injected via continual pretraining (CPT) on financial corpora; second, supervised fine-tuning (SFT) aligns model behavior with task objectives; third, high-quality reasoning data (FEVO-Train) is constructed using rule-based filtering and state-of-the-art reasoning models, followed by reinforcement learning (RL) to optimize logical reasoning paths. Experiments demonstrate that FEVO-R32B consistently outperforms same-scale general-purpose LLMs, larger baseline models, and pure RL-trained variants across five financial benchmarks. These results validate the effectiveness and advancement of our decoupled design—separately enhancing domain knowledge and reasoning capability.
📝 Abstract
Advancements in reasoning for large language models (LLMs) have lead to significant performance improvements for LLMs in various fields such as mathematics and programming. However, research applying these advances to the financial domain, where considerable domain-specific knowledge is necessary to complete tasks, remains limited. To address this gap, we introduce FEVO (Financial Evolution), a multi-stage enhancement framework developed to enhance LLM performance in the financial domain. FEVO systemically enhances LLM performance by using continued pre-training (CPT) to expand financial domain knowledge, supervised fine-tuning (SFT) to instill structured, elaborate reasoning patterns, and reinforcement learning (RL) to further integrate the expanded financial domain knowledge with the learned structured reasoning. To ensure effective and efficient training, we leverage frontier reasoning models and rule-based filtering to curate FEVO-Train, high-quality datasets specifically designed for the different post-training phases. Using our framework, we train the FEVO series of models -- C32B, S32B, R32B -- from Qwen2.5-32B and evaluate them on seven benchmarks to assess financial and general capabilities, with results showing that FEVO-R32B achieves state-of-the-art performance on five financial benchmarks against much larger models as well as specialist models. More significantly, FEVO-R32B demonstrates markedly better performance than FEVO-R32B-0 (trained from Qwen2.5-32B-Instruct using only RL), thus validating the effectiveness of financial domain knowledge expansion and structured, logical reasoning distillation