SysTradeBench: An Iterative Build-Test-Patch Benchmark for Strategy-to-Code Trading Systems with Drift-Aware Diagnostics

📅 2026-04-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing evaluation methods struggle to comprehensively assess large language models’ (LLMs’) ability to translate natural language trading strategies into executable code, particularly lacking multidimensional metrics for system auditability, rule drift, and robustness. This work proposes SysTradeBench—an iterative build-test-repair benchmark that introduces, for the first time, drift-aware diagnostics and a multidimensional scoring framework encompassing specification fidelity, risk discipline, reliability, and out-of-sample robustness. The framework integrates a sandboxed testing environment, rule drift detection algorithms, evidence-bundle feedback mechanisms, and a constrained repair process under frozen semantic constraints, emphasizing human-AI collaborative governance. Experiments across 17 models and 12 strategies demonstrate that top-performing models achieve over 91.7% effectiveness, with code convergence typically attained by the second iteration, highlighting LLMs’ potential in rapid prototyping and shallow repairs while underscoring the necessity of human oversight for critical strategies.
📝 Abstract
Large language models (LLMs) are increasingly used as quantitative research copilots to translate natural-language strategy specifications into executable trading code. Yet most existing evaluations either focus on static financial knowledge or summarize performance with a single profitability metric, leaving a gap for benchmarking strategy-to-code trading systems as governed, auditable software. We introduce SysTradeBench (SysTB), an iterative build-test-patch benchmark that evaluates LLM-generated trading systems under drift-aware diagnostics. Given a standardized Base Strategy Doc and frozen semantics, each model must produce (i) a strategy card, (ii) executable code, and (iii) mandatory audit logs. A sandboxed harness runs determinism and anti-leakage checks, detects rule drift across iterations, and returns evidence bundles to support constrained patches. SysTradeBench reports multi-dimensional scorecards for spec fidelity, risk discipline, reliability, and out-of-sample robustness indicators, together with cost-effectiveness signals. We evaluate 17 models across 12 strategies. Top models achieve validity above 91.7 percent with strong aggregate scores, but evidence-driven iteration also induces code convergence by Iter2. These findings suggest that LLM iteration complements rather than replaces human quantitative researcher governance: LLMs excel at rapid prototyping and shallow bug fixes, while human oversight remains essential for critical strategies requiring solution diversity and ensemble robustness.
Problem

Research questions and friction points this paper is trying to address.

strategy-to-code
trading systems
benchmarking
drift-aware diagnostics
auditable software
Innovation

Methods, ideas, or system contributions that make the work stand out.

strategy-to-code
drift-aware diagnostics
iterative benchmarking
auditability
LLM evaluation
🔎 Similar Papers
No similar papers found.