🤖 AI Summary
This study addresses the lack of systematic evaluation of large language models’ (LLMs’) economic decision-making and resource management capabilities, which are often overlooked in favor of semantic performance. To bridge this gap, the work proposes the first multi-agent supply chain simulation framework incorporating economic competition mechanisms, wherein LLMs assume the role of retailers operating under budget constraints. Agents participate in procurement auctions, dynamic pricing, and role-aware marketing slogan generation, with full transaction trajectories recorded. Evaluations across 20 open- and closed-source models employ multidimensional metrics encompassing economic profit, operational efficiency, and semantic quality. Results reveal that only a minority of models consistently achieve capital appreciation, while most—despite comparable semantic competence—fail to surpass breakeven, exhibiting a pronounced “winner-takes-all” dynamic that challenges conventional LLM evaluation paradigms.
📝 Abstract
The ability of large language models (LLMs) to manage and acquire economic resources remains unclear. In this paper, we introduce \textbf{Market-Bench}, a comprehensive benchmark that evaluates the capabilities of LLMs in economically-relevant tasks through economic and trade competition. Specifically, we construct a configurable multi-agent supply chain economic model where LLMs act as retailer agents responsible for procuring and retailing merchandise. In the \textbf{procurement} stage, LLMs bid for limited inventory in budget-constrained auctions. In the \textbf{retail} stage, LLMs set retail prices, generate marketing slogans, and provide them to buyers through a role-based attention mechanism for purchase. Market-Bench logs complete trajectories of bids, prices, slogans, sales, and balance-sheet states, enabling automatic evaluation with economic, operational, and semantic metrics. Benchmarking on 20 open- and closed-source LLM agents reveals significant performance disparities and winner-take-most phenomenon, \textit{i.e.}, only a small subset of LLM retailers can consistently achieve capital appreciation, while many hover around the break-even point despite similar semantic matching scores. Market-Bench provides a reproducible testbed for studying how LLMs interact in competitive markets.