StockBench: Can LLM Agents Trade Stocks Profitably In Real-world Markets?

📅 2025-10-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing financial benchmarks primarily assess static knowledge, failing to evaluate LLM agents’ sustained decision-making capabilities in dynamic, iterative real-world stock trading. Method: We introduce StockBench—the first contamination-free, multi-month, daily-frequency benchmark for evaluating LLM agents on real-market trading, supporting continuous buy/sell decisions grounded in price, fundamental, and news signals. It employs professional financial metrics—including cumulative return, maximum drawdown, and Sortino ratio—for quantitative evaluation. Contribution/Results: Experiments reveal that most LLM agents underperform a simple buy-and-hold baseline; however, state-of-the-art models demonstrate significantly higher return potential and superior risk management. This work bridges the critical gap between static knowledge assessment and practical trading capability evaluation. All code and data are open-sourced to ensure full reproducibility.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) have recently demonstrated strong capabilities as autonomous agents, showing promise in reasoning, tool use, and sequential decision-making. While prior benchmarks have evaluated LLM agents in domains such as software engineering and scientific discovery, the finance domain remains underexplored, despite its direct relevance to economic value and high-stakes decision-making. Existing financial benchmarks primarily test static knowledge through question answering, but they fall short of capturing the dynamic and iterative nature of trading. To address this gap, we introduce StockBench, a contamination-free benchmark designed to evaluate LLM agents in realistic, multi-month stock trading environments. Agents receive daily market signals -- including prices, fundamentals, and news -- and must make sequential buy, sell, or hold decisions. Performance is assessed using financial metrics such as cumulative return, maximum drawdown, and the Sortino ratio. Our evaluation of state-of-the-art proprietary (e.g., GPT-5, Claude-4) and open-weight (e.g., Qwen3, Kimi-K2, GLM-4.5) models shows that while most LLM agents struggle to outperform the simple buy-and-hold baseline, several models demonstrate the potential to deliver higher returns and manage risk more effectively. These findings highlight both the challenges and opportunities in developing LLM-powered financial agents, showing that excelling at static financial knowledge tasks does not necessarily translate into successful trading strategies. We release StockBench as an open-source resource to support reproducibility and advance future research in this domain.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLM agents in realistic stock trading environments
Assessing dynamic decision-making beyond static financial knowledge
Measuring trading performance using financial risk-return metrics
Innovation

Methods, ideas, or system contributions that make the work stand out.

StockBench benchmark for realistic stock trading evaluation
Agents use daily market signals for sequential decisions
Performance assessed via financial metrics like cumulative return
🔎 Similar Papers
No similar papers found.
Y
Yanxu Chen
Tsinghua University
Z
Zijun Yao
Tsinghua University
Yantao Liu
Yantao Liu
Qwen, Alibaba
Reinforcement LearningReward ModelingLarge Language Models
J
Jin Ye
Beijing University of Posts and Telecommunications
J
Jianing Yu
Beijing University of Posts and Telecommunications
Lei Hou
Lei Hou
RMIT University
Building Information Modeling (BIM) - Project Management - Construction IT - Productivity Research - Lean Construction
Juanzi Li
Juanzi Li
Tsinghua University
Semantic Webdata miningNLP