🤖 AI Summary
A lack of unified, standardized evaluation benchmarks and datasets for financial decision-making hinders comparable and reliable assessment of LLM-based agents in multi-asset scenarios.
Method: We introduce FinAgentBench—the first benchmark platform specifically designed for evaluating LLM agents in financial decision-making—covering diverse asset classes including equities, cryptocurrencies, and ETFs. It establishes a standardized evaluation framework comprising 12 core tasks (e.g., risk control, asset allocation, trend forecasting), integrates multimodal open-source data (textual, time-series, chart-based), provides reproducible simulation environments, and adopts a unified agent architecture. We instantiate agents from 13 state-of-the-art LLMs, enhanced with reinforcement learning and chain-of-thought reasoning for decision-making.
Contribution/Results: FinAgentBench enables systematic, cross-model, cross-market, and cross-task evaluation. Experiments demonstrate significantly improved assessment consistency and reproducibility. All code and data are fully open-sourced.
📝 Abstract
Recent advancements have underscored the potential of large language model (LLM)-based agents in financial decision-making. Despite this progress, the field currently encounters two main challenges: (1) the lack of a comprehensive LLM agent framework adaptable to a variety of financial tasks, and (2) the absence of standardized benchmarks and consistent datasets for assessing agent performance. To tackle these issues, we introduce extsc{InvestorBench}, the first benchmark specifically designed for evaluating LLM-based agents in diverse financial decision-making contexts. InvestorBench enhances the versatility of LLM-enabled agents by providing a comprehensive suite of tasks applicable to different financial products, including single equities like stocks, cryptocurrencies and exchange-traded funds (ETFs). Additionally, we assess the reasoning and decision-making capabilities of our agent framework using thirteen different LLMs as backbone models, across various market environments and tasks. Furthermore, we have curated a diverse collection of open-source, multi-modal datasets and developed a comprehensive suite of environments for financial decision-making. This establishes a highly accessible platform for evaluating financial agents' performance across various scenarios.