🤖 AI Summary
Current research on LLM-driven trading systems is widely hindered by the absence of comparable evaluation protocols, well-defined execution semantics, and reproducibility, significantly impeding progress in the field. This work proposes an Architecture–Capability–Adaptation analytical framework to systematically review 77 studies, employing protocol encoding snapshots, empirical categorization, R0–R3 reproducibility ratings, and transaction semantics auditing to construct an evidence ledger and a standardized reporting checklist. The analysis reveals that only a minority of studies satisfy closed-loop evaluation criteria, while most lack temporally consistent data splits, explicit transaction cost modeling, or proper handling of survivorship bias. These findings underscore the urgent need for standardized evaluation practices and provide a reproducible benchmark and methodological guidance for future research.
📝 Abstract
A growing body of work explores how Large Language Models (LLMs) can be embedded in trading systems as agents that perceive market information, retrieve context, reason about decisions, emit tradable actions, and adapt under market feedback. This paper reframes LLM-based trading agents as expert-system decision pipelines and presents an audit-oriented evidence map of 77 included studies in a protocol-coded snapshot screened through 2026-03-09. A primary empirical subset (n=19) satisfies the minimum boundary of Action Output plus Closed-Loop Evaluation; the remaining 58 included studies are retained as background and design context. The central empirical finding is protocol incomparability: within the primary subset, only 2/19 studies report extractable time-consistent split protocols, 1/19 reports an explicit transaction-cost model, 1/19 documents universe or survivorship handling, 11/19 report execution timing or semantics, 15/19 are coded as R0, and no study reaches R3 reproducibility. We therefore use Architecture-Capability-Adaptation as a working analytical lens rather than a validated taxonomy, and we foreground the evidence ledger, reproducibility audit, and reporting checklist as the main contributions. The resulting survey shows that architectural experimentation is expanding rapidly, while comparable evaluation protocols, execution semantics, and reproducible artifacts remain the field's immediate bottlenecks.