Beyond Survival: Evaluating LLMs in Social Deduction Games with Human-Aligned Strategies

📅 2025-10-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Prior work simplifies social reasoning games (e.g., Werewolf) to large language model (LLM) self-play, yielding templated outputs, overlooking social complexity, and relying on coarse-grained evaluation metrics (e.g., survival time) without high-quality reference data. Method: We introduce the first million-token, multi-rule variant, multimodal Werewolf dataset; and propose a human-strategy-aligned, two-stage evaluation framework—using winning-side strategies as ground truth—to assess linguistic stance (via multiple-choice tasks) and decision inference (via voting behavior and role identification), enabling fine-grained, quantifiable evaluation. Contribution/Results: Experiments reveal significant performance divergence among mainstream LLMs: nearly half score below 0.50, exposing critical deficits in deception handling and counterfactual reasoning. Our framework demonstrates strong validity and diagnostic utility for evaluating social reasoning capabilities in LLMs.

Technology Category

Application Category

📝 Abstract
Social deduction games like Werewolf combine language, reasoning, and strategy, providing a testbed for studying natural language and social intelligence. However, most studies reduce the game to LLM-based self-play, yielding templated utterances and anecdotal cases that overlook the richness of social gameplay. Evaluation further relies on coarse metrics such as survival time or subjective scoring due to the lack of quality reference data. To address these gaps, we curate a high-quality, human-verified multimodal Werewolf dataset containing over 100 hours of video, 32.4M utterance tokens, and 15 rule variants. Based on this dataset, we propose a novel strategy-alignment evaluation that leverages the winning faction's strategies as ground truth in two stages: 1) Speech evaluation, formulated as multiple-choice-style tasks that assess whether the model can adopt appropriate stances across five dimensions of social ability; and 2) Decision evaluation, which assesses the model's voting choices and opponent-role inferences. This framework enables a fine-grained evaluation of models' linguistic and reasoning capabilities, while capturing their ability to generate strategically coherent gameplay. Our experiments show that state-of-the-art LLMs show diverse performance, with roughly half remain below 0.50, revealing clear gaps in deception and counterfactual reasoning. We hope our dataset further inspires research on language, reasoning, and strategy in multi-agent interaction.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' social intelligence in deduction games beyond survival metrics
Assessing strategic alignment through speech and decision evaluation frameworks
Addressing gaps in deception and counterfactual reasoning capabilities
Innovation

Methods, ideas, or system contributions that make the work stand out.

Human-verified multimodal dataset for social deduction
Strategy-alignment evaluation using winning faction strategies
Fine-grained assessment of linguistic and reasoning capabilities
🔎 Similar Papers
No similar papers found.
Zirui Song
Zirui Song
PhD student in MBZUAI
NLP
Y
Yuan Huang
Northeastern University
J
Junchang Liu
Northeastern University
Haozhe Luo
Haozhe Luo
University of Bern (ARTORG)
Medical Image AnalysisComputer Vision
C
Chenxi Wang
Mohamed bin Zayed University of Artificial Intelligence (MBZUAI)
Lang Gao
Lang Gao
MBZUAI
Mechanistic InterpretabilityNatural Language Processing
Z
Zixiang Xu
Mohamed bin Zayed University of Artificial Intelligence (MBZUAI)
Mingfei Han
Mingfei Han
MBZUAI; University of Technology Sydney; Bytedance Seed; MMLab, SIAT
Object RecognitionVideo UnderstandingVision Language ModelsRobotics
X
Xiaojun Chang
Mohamed bin Zayed University of Artificial Intelligence (MBZUAI)
Xiuying Chen
Xiuying Chen
MBZUAI
Trustworthy NLPHuman-Centered NLPComputational Social Science