🤖 AI Summary
This work addresses the limited evaluation of large language models (LLMs) in financial quantitative tasks, which has predominantly focused on knowledge-based question answering and fails to capture genuine quantitative reasoning and strategy implementation capabilities. To bridge this gap, we propose QuantEval, a comprehensive benchmark that systematically assesses models across three dimensions: financial knowledge, quantitative mathematical reasoning, and strategy coding. For the first time, QuantEval integrates an executable CTA-style backtesting framework and standard financial performance metrics into the evaluation pipeline, enabling realistic and reproducible assessment of quantitative skills. Leveraging a deterministic backtesting environment—complete with a defined asset universe, transaction costs, and standardized metrics—we fine-tune models via supervised learning and reinforcement learning on domain-specific data. Experiments reveal that current state-of-the-art models significantly underperform human experts in reasoning and strategy generation, yet our approach markedly improves their performance on QuantEval.
📝 Abstract
Large Language Models (LLMs) have shown strong capabilities across many domains, yet their evaluation in financial quantitative tasks remains fragmented and mostly limited to knowledge-centric question answering. We introduce QuantEval, a benchmark that evaluates LLMs across three essential dimensions of quantitative finance: knowledge-based QA, quantitative mathematical reasoning, and quantitative strategy coding. Unlike prior financial benchmarks, QuantEval integrates a CTA-style backtesting framework that executes model-generated strategies and evaluates them using financial performance metrics, enabling a more realistic assessment of quantitative coding ability. We evaluate some state-of-the-art open-source and proprietary LLMs and observe substantial gaps to human experts, particularly in reasoning and strategy coding. Finally, we conduct large-scale supervised fine-tuning and reinforcement learning experiments on domain-aligned data, demonstrating consistent improvements. We hope QuantEval will facilitate research on LLMs'quantitative finance capabilities and accelerate their practical adoption in real-world trading workflows. We additionally release the full deterministic backtesting configuration (asset universe, cost model, and metric definitions) to ensure strict reproducibility.