CHBench: A Cognitive Hierarchy Benchmark for Evaluating Strategic Reasoning Capability of LLMs

📅 2025-08-16

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

Existing evaluation metrics for strategic reasoning in large language models (LLMs) rely on utility-based measures, which lack robustness against variations in opponent strategies and game structures. Method: We propose a novel evaluation framework grounded in Cognitive Hierarchy Theory (CHT), systematically assessing LLMs’ reasoning consistency across 15 canonical normal-form games under diverse opponents and structural configurations. Our experiments involve six state-of-the-art LLMs, augmented with fine-grained behavioral analysis. Contribution/Results: We provide the first empirical evidence that chat-based interaction degrades, while memory-augmented prompting enhances, strategic reasoning depth—quantified as cognitive hierarchy level. The framework demonstrates strong robustness and cross-game generalizability, enabling stable identification of latent cognitive hierarchies. It establishes a new, interpretable paradigm for evaluating LLMs’ strategic reasoning capabilities, moving beyond utility-centric assessments toward cognitively grounded, behaviorally validated metrics.

Technology Category

Application Category

📝 Abstract

Game-playing ability serves as an indicator for evaluating the strategic reasoning capability of large language models (LLMs). While most existing studies rely on utility performance metrics, which are not robust enough due to variations in opponent behavior and game structure. To address this limitation, we propose extbf{Cognitive Hierarchy Benchmark (CHBench)}, a novel evaluation framework inspired by the cognitive hierarchy models from behavioral economics. We hypothesize that agents have bounded rationality -- different agents behave at varying reasoning depths/levels. We evaluate LLMs' strategic reasoning through a three-phase systematic framework, utilizing behavioral data from six state-of-the-art LLMs across fifteen carefully selected normal-form games. Experiments show that LLMs exhibit consistent strategic reasoning levels across diverse opponents, confirming the framework's robustness and generalization capability. We also analyze the effects of two key mechanisms (Chat Mechanism and Memory Mechanism) on strategic reasoning performance. Results indicate that the Chat Mechanism significantly degrades strategic reasoning, whereas the Memory Mechanism enhances it. These insights position CHBench as a promising tool for evaluating LLM capabilities, with significant potential for future research and practical applications.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' strategic reasoning via game-playing ability

Addressing robustness issues in existing utility performance metrics

Assessing impact of chat and memory mechanisms on reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces CHBench for LLM strategic evaluation

Uses cognitive hierarchy models from economics

Tests chat and memory mechanisms' effects

🔎 Similar Papers

Language Model Council: Democratically Benchmarking Foundation Models on Highly Subjective Tasks