BacktestBench: Benchmarking Large Language Models for Automated Quantitative Strategy Backtesting

📅 2026-05-18

📈 Citations: 0

✨ Influential: 0

career value

174K/year

🤖 AI Summary

This study addresses the challenges in large language model (LLM)-driven automated quantitative research, which are hindered by high technical barriers and the absence of standardized benchmarks for backtesting. To bridge this gap, we introduce BacktestBench, the first large-scale benchmark for automated quantitative backtesting, comprising four task categories and 18,246 annotated samples. We further propose AutoBacktest, a multi-agent system that integrates a Summarizer, Retriever, and Coder to enable end-to-end generation of reproducible backtesting code from natural language strategy descriptions. Leveraging real-market data, SQL queries, and a Python-based backtesting engine, we conduct a systematic evaluation across 23 mainstream LLMs, identifying key factors influencing performance and demonstrating the feasibility and effectiveness of LLM-powered backtesting automation.

📝 Abstract

Quantitative backtesting is essential for evaluating trading strategies but remains hampered by high technical barriers and limited scalability. While Large Language Models (LLMs) offer a transformative path to automate this complex, interdisciplinary workflow through advanced code generation, tool usage, and agentic planning, the practical realization is significantly challenged by the current lack of a large-scale benchmark dedicated to automated quantitative backtesting, which hinders progress in this field. To bridge this critical gap, we introduce BacktestBench, the first large-scale benchmark for automated quantitative backtesting. Built from over 6 million real market records, it comprises 18,246 meticulously annotated question-answering pairs across four task categories: metrics calculation, ticker selection, strategy selection, and parameter confirmation. We also propose AutoBacktest, a robust multi-agent baseline that translates natural language strategies into reproducible backtests by coordinating a Summarizer for semantic factor extraction, a Retriever for validated SQL generation, and a Coder for Python backtesting implementation. Our evaluation on 23 mainstream LLMs, complemented by targeted ablations, identifies key factors that influence end-to-end performance and highlights the importance of grounded verification and standardized indicator representations.

Problem

Research questions and friction points this paper is trying to address.

quantitative backtesting

Large Language Models

benchmark

automated trading strategies

code generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

BacktestBench

automated quantitative backtesting

Large Language Models