🤖 AI Summary
Existing LLM agent evaluation benchmarks predominantly focus on single-task settings or rely on coarse-grained metrics, failing to capture agents’ true capabilities in complex, dynamic, multi-objective strategic decision-making. To address this gap, we propose DSGBench—the first strategic decision evaluation platform designed for long-horizon, multi-objective games. It comprises six customizable task categories, each supporting adjustable difficulty and objective configurations. We introduce a novel five-dimensional fine-grained decision capability scoring framework and an automated strategy evolution tracking mechanism. Leveraging multi-game environment modeling, structured behavioral log analysis, dimension-specific scoring functions, and a standardized LLM agent interface, DSGBench enables reproducible and attributable evaluation. Experiments demonstrate that DSGBench significantly improves agent selection accuracy and enhances precise identification of strategic deficiencies. The codebase and benchmark data are publicly released.
📝 Abstract
Large Language Model~(LLM) based agents have been increasingly popular in solving complex and dynamic tasks, which requires proper evaluation systems to assess their capabilities. Nevertheless, existing benchmarks usually either focus on single-objective tasks or use overly broad assessing metrics, failing to provide a comprehensive inspection of the actual capabilities of LLM-based agents in complicated decision-making tasks. To address these issues, we introduce DSGBench, a more rigorous evaluation platform for strategic decision-making. Firstly, it incorporates six complex strategic games which serve as ideal testbeds due to their long-term and multi-dimensional decision-making demands and flexibility in customizing tasks of various difficulty levels or multiple targets. Secondly, DSGBench employs a fine-grained evaluation scoring system which examines the decision-making capabilities by looking into the performance in five specific dimensions and offering a comprehensive assessment in a well-designed way. Furthermore, DSGBench also incorporates an automated decision-tracking mechanism which enables in-depth analysis of agent behaviour patterns and the changes in their strategies. We demonstrate the advances of DSGBench by applying it to multiple popular LLM-based agents and our results suggest that DSGBench provides valuable insights in choosing LLM-based agents as well as improving their future development. DSGBench is available at https://github.com/DeciBrain-Group/DSGBench.