SC2Arena and StarEvolve: Benchmark and Self-Improvement Framework for LLMs in Complex Decision-Making Tasks

📅 2025-08-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing LLM evaluation benchmarks for *StarCraft II* suffer from incomplete race coverage, coarse-grained action spaces, and weak observational representations, limiting their ability to assess realistic, complex decision-making. To address these limitations, we propose SC2Arena—the first open-source benchmark supporting all three races, native low-level action spaces, and optimized textual observations. We further introduce StarEvolve, a self-evolving framework integrating a planning-execution-verification闭环, high-quality sample scoring, and a hierarchical decision architecture enhanced by text-augmented spatial reasoning and iterative self-correction. Experiments demonstrate that StarEvolve significantly outperforms prior methods in strategic planning capability. SC2Arena establishes a new paradigm for evaluating and advancing generalist agents in complex, dynamic environments. All code, environments, and algorithms are publicly released.

Technology Category

Application Category

📝 Abstract
Evaluating large language models (LLMs) in complex decision-making is essential for advancing AI's ability for strategic planning and real-time adaptation. However, existing benchmarks for tasks like StarCraft II fail to capture the game's full complexity, such as its complete game context, diverse action spaces, and all playable races. To address this gap, we present SC2Arena, a benchmark that fully supports all playable races, low-level action spaces, and optimizes text-based observations to tackle spatial reasoning challenges. Complementing this, we introduce StarEvolve, a hierarchical framework that integrates strategic planning with tactical execution, featuring iterative self-correction and continuous improvement via fine-tuning on high-quality gameplay data. Its key components include a Planner-Executor-Verifier structure to break down gameplay, and a scoring system for selecting high-quality training samples. Comprehensive analysis using SC2Arena provides valuable insights into developing generalist agents that were not possible with previous benchmarks. Experimental results also demonstrate that our proposed StarEvolve achieves superior performance in strategic planning. Our code, environment, and algorithms are publicly available.
Problem

Research questions and friction points this paper is trying to address.

Existing benchmarks lack full StarCraft II complexity coverage
Need for improved LLM evaluation in strategic decision-making tasks
Requirement for hierarchical self-improvement frameworks in gameplay
Innovation

Methods, ideas, or system contributions that make the work stand out.

SC2Arena benchmark supports all StarCraft II complexities
StarEvolve integrates strategic planning with tactical execution
Planner-Executor-Verifier structure enables iterative self-correction
🔎 Similar Papers
No similar papers found.
P
Pengbo Shen
The Key Laboratory of Cognition and Decision Intelligence for Complex Systems, Institute of Automation, Chinese Academy of Sciences
Y
Yaqing Wang
The Key Laboratory of Cognition and Decision Intelligence for Complex Systems, Institute of Automation, Chinese Academy of Sciences
N
Ni Mu
Beijing Key Laboratory of Embodied Intelligence Systems, Department of Automation, Tsinghua University
Y
Yao Luan
Beijing Key Laboratory of Embodied Intelligence Systems, Department of Automation, Tsinghua University
R
Runpeng Xie
The Key Laboratory of Cognition and Decision Intelligence for Complex Systems, Institute of Automation, Chinese Academy of Sciences
S
Senhao Yang
The Key Laboratory of Cognition and Decision Intelligence for Complex Systems, Institute of Automation, Chinese Academy of Sciences
L
Lexiang Wang
The Key Laboratory of Cognition and Decision Intelligence for Complex Systems, Institute of Automation, Chinese Academy of Sciences
H
Hao Hu
Moonshot AI, Beijing, China
S
Shuang Xu
The Key Laboratory of Cognition and Decision Intelligence for Complex Systems, Institute of Automation, Chinese Academy of Sciences
Yiqin Yang
Yiqin Yang
Assistant Professor,Institue of Automation,Chinese Academy of Sciences
Reinforcement LearningEmbodied Intelligence
B
Bo Xu
The Key Laboratory of Cognition and Decision Intelligence for Complex Systems, Institute of Automation, Chinese Academy of Sciences