Beyond Scaling: Assessing Strategic Reasoning and Rapid Decision-Making Capability of LLMs in Zero-sum Environments

📅 2026-03-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing evaluation methods struggle to assess the decision-making and execution capabilities of large language models as interactive agents in adversarial, time-sensitive environments. To address this gap, this work proposes STAR, a multi-agent evaluation framework that models reasoning as an iterative, adaptive decision process within 1v1 zero-sum games. STAR supports both turn-based and real-time modes, enabling unified assessment of long-term strategic planning and rapid tactical execution. It is the first framework to jointly incorporate strategic depth and execution timeliness, introducing metrics such as execution efficiency and outcome stability to reveal the strategy-execution gap. The framework provides a modular, reproducible, and standardized benchmarking platform. Experiments show that reasoning-intensive models excel in turn-based settings but underperform in real-time scenarios due to latency, whereas lightweight instruction-tuned models achieve superior performance, highlighting the necessity of balancing reasoning depth with action timeliness in interactive intelligence.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) have achieved strong performance on static reasoning benchmarks, yet their effectiveness as interactive agents operating in adversarial, time-sensitive environments remains poorly understood. Existing evaluations largely treat reasoning as a single-shot capability, overlooking the challenges of opponent-aware decision-making, temporal constraints, and execution under pressure. This paper introduces Strategic Tactical Agent Reasoning (STAR) Benchmark, a multi-agent evaluation framework that assesses LLMs through 1v1 zero-sum competitive interactions, framing reasoning as an iterative, adaptive decision-making process. STAR supports both turn-based and real-time settings, enabling controlled analysis of long-horizon strategic planning and fast-paced tactical execution within a unified environment. Built on a modular architecture with a standardized API and fully implemented execution engine, STAR facilitates reproducible evaluation and flexible task customization. To move beyond binary win-loss outcomes, we introduce a Strategic Evaluation Suite that assesses not only competitive success but also the quality of strategic behavior, such as execution efficiency and outcome stability. Extensive pairwise evaluations reveal a pronounced strategy-execution gap: while reasoning-intensive models dominate turn-based settings, their inference latency often leads to inferior performance in real-time scenarios, where faster instruction-tuned models prevail. These results show that strategic intelligence in interactive environments depends not only on reasoning depth, but also on the ability to translate plans into timely actions, positioning STAR as a principled benchmark for studying this trade-off in competitive, dynamic settings.
Problem

Research questions and friction points this paper is trying to address.

strategic reasoning
rapid decision-making
zero-sum environments
interactive agents
time-sensitive
Innovation

Methods, ideas, or system contributions that make the work stand out.

Strategic Reasoning
Real-time Decision-Making
Multi-agent Evaluation
Zero-sum Games
LLM Benchmarking
🔎 Similar Papers
No similar papers found.
Yang Li
Yang Li
Tsinghua Shenzhen International Graduate School
transfer learningtrustworthy AIrepresentation learningspatial algorithms
X
Xing Chen
Y
Yutao Liu
Ocean University of China (OUC)
G
Gege Qi
CAICT
Y
Yanxian BI
CAEIT
Z
Zizhe Wang
Beihang University
Y
Yunjian Zhang
UCAS
Yao Zhu
Yao Zhu
Zhejiang University
Robust machine learning