When Agents Trade: Live Multi-Market Trading Benchmark for LLM Agents

📅 2025-10-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing LLM-based trading agent research relies heavily on backtesting, small-scale datasets, and static model evaluations, lacking a long-term, multi-asset, reproducible benchmark for real-market assessment. Method: We introduce Agent Market Arena (AMA), the first lifelong real-time evaluation benchmark for LLM trading agents, integrating empirically validated cryptocurrency and equity market data alongside expert-vetted financial news to continuously assess agents’ financial reasoning and adaptive capabilities in live environments. We propose a multi-agent comparative framework to isolate architectural effects from base model influences. Results: Experiments with state-of-the-art models—including GPT-4o, Claude-3.5-haiku, and Gemini-2.0-flash—augmented with memory-enhanced reasoning and risk modeling reveal stable, distinguishable behavioral patterns across strategy aggressiveness and risk preference. Crucially, system architecture exerts significantly greater influence on agent behavior than the choice of underlying LLM. AMA demonstrates strong validity and scalability for evaluating trading intelligence in realistic financial settings.

Technology Category

Application Category

📝 Abstract
Although Large Language Model (LLM)-based agents are increasingly used in financial trading, it remains unclear whether they can reason and adapt in live markets, as most studies test models instead of agents, cover limited periods and assets, and rely on unverified data. To address these gaps, we introduce Agent Market Arena (AMA), the first lifelong, real-time benchmark for evaluating LLM-based trading agents across multiple markets. AMA integrates verified trading data, expert-checked news, and diverse agent architectures within a unified trading framework, enabling fair and continuous comparison under real conditions. It implements four agents, including InvestorAgent as a single-agent baseline, TradeAgent and HedgeFundAgent with different risk styles, and DeepFundAgent with memory-based reasoning, and evaluates them across GPT-4o, GPT-4.1, Claude-3.5-haiku, Claude-sonnet-4, and Gemini-2.0-flash. Live experiments on both cryptocurrency and stock markets demonstrate that agent frameworks display markedly distinct behavioral patterns, spanning from aggressive risk-taking to conservative decision-making, whereas model backbones contribute less to outcome variation. AMA thus establishes a foundation for rigorous, reproducible, and continuously evolving evaluation of financial reasoning and trading intelligence in LLM-based agents.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLM agents' reasoning and adaptation in live financial markets
Addressing limited testing periods, assets, and unverified data in trading benchmarks
Establishing continuous multi-market evaluation for diverse agent architectures
Innovation

Methods, ideas, or system contributions that make the work stand out.

Lifelong real-time benchmark for multi-market trading agents
Integrates verified data and expert-checked news sources
Implements diverse agent architectures with memory-based reasoning
🔎 Similar Papers
No similar papers found.