Microscopic Analysis on LLM players via Social Deduction Game

📅 2024-08-19

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

183K/year

🤖 AI Summary

Existing evaluations of large language models (LLMs) in social deduction games (SDGs) suffer from two key limitations: (1) coarse-grained metrics—relying solely on win/loss outcomes while ignoring event-level behavioral dynamics—and (2) absence of structured error analysis. To address these, we introduce SpyGame, a novel SDG platform inspired by Spyfall, featuring an event-level, eight-dimensional quantitative metric suite and a structured thematic coding framework. This enables fine-grained skill decomposition—specifically for intent recognition and deception—and joint error attribution. Through multimodal behavioral quantification, thematic coding, and qualitative–quantitative triangulation, our experiments demonstrate that the eight fine-grained metrics substantially outperform traditional binary outcome evaluation. Moreover, we identify four critical behavioral failure categories, whose qualitative patterns exhibit strong complementarity with quantitative findings. This work establishes the first interpretable, attributable, and reproducible fine-grained evaluation paradigm for LLMs’ social reasoning capabilities.

Technology Category

Application Category

📝 Abstract

Recent studies have begun developing autonomous game players for social deduction games using large language models (LLMs). When building LLM players, fine-grained evaluations are crucial for addressing weaknesses in game-playing abilities. However, existing studies have often overlooked such assessments. Specifically, we point out two issues with the evaluation methods employed. First, game-playing abilities have typically been assessed through game-level outcomes rather than specific event-level skills; Second, error analyses have lacked structured methodologies. To address these issues, we propose an approach utilizing a variant of the SpyFall game, named SpyGame. We conducted an experiment with four LLMs, analyzing their gameplay behavior in SpyGame both quantitatively and qualitatively. For the quantitative analysis, we introduced eight metrics to resolve the first issue, revealing that these metrics are more effective than existing ones for evaluating the two critical skills: intent identification and camouflage. In the qualitative analysis, we performed thematic analysis to resolve the second issue. This analysis identifies four major categories that affect gameplay of LLMs. Additionally, we demonstrate how these categories complement and support the findings from the quantitative analysis.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs in social deduction games lacks fine-grained metrics

Existing error analyses lack structured methodologies for meaningful insights

Assessing LLMs' reasoning failures in obscured communication is inadequate

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces seven fine-grained evaluation metrics

Conducts thematic analysis of reasoning failures

Systematic approach for obscured communication evaluation

🔎 Similar Papers

No similar papers found.