🤖 AI Summary
Existing Q&A benchmarks rely heavily on static data and inadequately assess LLMs’ strategic reasoning and psychological adaptability in dynamic, adversarial settings.
Method: This work introduces a multi-agent adversarial evaluation framework grounded in five board games, implemented on the custom platform Qi Town, where 20 LLM agents engage in round-robin gameplay. We propose the Performance Loop Graph (PLG) to quantify strategic stability, integrate Positive Sentiment Score (PSS) to measure emotional resilience, and combine Elo ratings with fine-grained sentiment analysis.
Contribution/Results: Experiments reveal that while most LLMs sustain positive affect and strong adaptability under high-pressure competition, PLG uncovers prevalent win-rate cycles—indicating intrinsic instability in strategic performance. To our knowledge, this is the first systematic evaluation of LLM adversarial intelligence along both *strategic* and *psychological* dimensions, establishing a novel paradigm for benchmarking beyond conventional question-answering tasks.
📝 Abstract
Adversarial board games, as a paradigmatic domain of strategic reasoning and intelligence, have long served as both a popular competitive activity and a benchmark for evaluating artificial intelligence (AI) systems. Building on this foundation, we propose an adversarial benchmarking framework to assess the comprehensive performance of Large Language Models (LLMs) through board games competition, compensating the limitation of data dependency of the mainstream Question-and-Answer (Q&A) based benchmark method. We introduce Qi Town, a specialized evaluation platform that supports 5 widely played games and involves 20 LLM-driven players. The platform employs both the Elo rating system and a novel Performance Loop Graph (PLG) to quantitatively evaluate the technical capabilities of LLMs, while also capturing Positive Sentiment Score (PSS) throughout gameplay to assess mental fitness. The evaluation is structured as a round-robin tournament, enabling systematic comparison across players. Experimental results indicate that, despite technical differences, most LLMs remain optimistic about winning and losing, demonstrating greater adaptability to high-stress adversarial environments than humans. On the other hand, the complex relationship between cyclic wins and losses in PLGs exposes the instability of LLMs' skill play during games, warranting further explanation and exploration.