DSGBench: A Diverse Strategic Game Benchmark for Evaluating LLM-based Agents in Complex Decision-Making Environments

📅 2025-03-08

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

Existing LLM agent evaluation benchmarks predominantly focus on single-task settings or rely on coarse-grained metrics, failing to capture agents’ true capabilities in complex, dynamic, multi-objective strategic decision-making. To address this gap, we propose DSGBench—the first strategic decision evaluation platform designed for long-horizon, multi-objective games. It comprises six customizable task categories, each supporting adjustable difficulty and objective configurations. We introduce a novel five-dimensional fine-grained decision capability scoring framework and an automated strategy evolution tracking mechanism. Leveraging multi-game environment modeling, structured behavioral log analysis, dimension-specific scoring functions, and a standardized LLM agent interface, DSGBench enables reproducible and attributable evaluation. Experiments demonstrate that DSGBench significantly improves agent selection accuracy and enhances precise identification of strategic deficiencies. The codebase and benchmark data are publicly released.

Technology Category

Application Category

📝 Abstract

Large Language Model~(LLM) based agents have been increasingly popular in solving complex and dynamic tasks, which requires proper evaluation systems to assess their capabilities. Nevertheless, existing benchmarks usually either focus on single-objective tasks or use overly broad assessing metrics, failing to provide a comprehensive inspection of the actual capabilities of LLM-based agents in complicated decision-making tasks. To address these issues, we introduce DSGBench, a more rigorous evaluation platform for strategic decision-making. Firstly, it incorporates six complex strategic games which serve as ideal testbeds due to their long-term and multi-dimensional decision-making demands and flexibility in customizing tasks of various difficulty levels or multiple targets. Secondly, DSGBench employs a fine-grained evaluation scoring system which examines the decision-making capabilities by looking into the performance in five specific dimensions and offering a comprehensive assessment in a well-designed way. Furthermore, DSGBench also incorporates an automated decision-tracking mechanism which enables in-depth analysis of agent behaviour patterns and the changes in their strategies. We demonstrate the advances of DSGBench by applying it to multiple popular LLM-based agents and our results suggest that DSGBench provides valuable insights in choosing LLM-based agents as well as improving their future development. DSGBench is available at https://github.com/DeciBrain-Group/DSGBench.

Problem

Research questions and friction points this paper is trying to address.

Evaluates LLM-based agents in complex decision-making environments

Addresses lack of comprehensive benchmarks for strategic decision-making

Introduces DSGBench with diverse games and fine-grained evaluation metrics

Innovation

Methods, ideas, or system contributions that make the work stand out.

DSGBench introduces six complex strategic games.

Employs a fine-grained evaluation scoring system.

Incorporates an automated decision-tracking mechanism.

🔎 Similar Papers

How Far Are We on the Decision-Making of LLMs? Evaluating LLMs' Gaming Ability in Multi-Agent Environments