🤖 AI Summary
Existing process reward models (PRMs) underperform in finance due to the domain’s highly structured, symbolic reasoning requirements and stringent demands on factual accuracy and regulatory compliance. To address this, we propose Fin-PRM—the first PRM specifically designed for financial applications—introducing a novel joint step-level and trajectory-level reward mechanism. We further develop a trajectory-aware offline/online reward learning framework, enabling three distinct deployment paradigms: reward distillation, reinforcement learning, and test-time inference optimization. Fin-PRM explicitly incorporates financial logic constraints into its reward modeling to enhance reasoning path quality and fidelity. Evaluated on CFLUE and FinQA benchmarks, Fin-PRM achieves substantial improvements over general-purpose PRMs and domain-specific baselines: +12.9% in supervised learning, +5.2% in reinforcement learning, and +5.1% in test-time optimization—marking the first successful specialization of PRMs to the financial domain.
📝 Abstract
Process Reward Models (PRMs) have emerged as a promising framework for supervising intermediate reasoning in large language models (LLMs), yet existing PRMs are primarily trained on general or Science, Technology, Engineering, and Mathematics (STEM) domains and fall short in domain-specific contexts such as finance, where reasoning is more structured, symbolic, and sensitive to factual and regulatory correctness. We introduce extbf{Fin-PRM}, a domain-specialized, trajectory-aware PRM tailored to evaluate intermediate reasoning steps in financial tasks. Fin-PRM integrates step-level and trajectory-level reward supervision, enabling fine-grained evaluation of reasoning traces aligned with financial logic. We apply Fin-PRM in both offline and online reward learning settings, supporting three key applications: (i) selecting high-quality reasoning trajectories for distillation-based supervised fine-tuning, (ii) providing dense process-level rewards for reinforcement learning, and (iii) guiding reward-informed Best-of-N inference at test time. Experimental results on financial reasoning benchmarks, including CFLUE and FinQA, demonstrate that Fin-PRM consistently outperforms general-purpose PRMs and strong domain baselines in trajectory selection quality. Downstream models trained with Fin-PRM yield substantial improvements with baselines, with gains of 12.9% in supervised learning, 5.2% in reinforcement learning, and 5.1% in test-time performance. These findings highlight the value of domain-specialized reward modeling for aligning LLMs with expert-level financial reasoning. Our project resources will be available at https://github.com/aliyun/qwen-dianjin.