🤖 AI Summary
Existing evaluations of large language models (LLMs) in social deduction games (SDGs) suffer from two key limitations: (1) coarse-grained metrics—relying solely on win/loss outcomes while ignoring event-level behavioral dynamics—and (2) absence of structured error analysis. To address these, we introduce SpyGame, a novel SDG platform inspired by Spyfall, featuring an event-level, eight-dimensional quantitative metric suite and a structured thematic coding framework. This enables fine-grained skill decomposition—specifically for intent recognition and deception—and joint error attribution. Through multimodal behavioral quantification, thematic coding, and qualitative–quantitative triangulation, our experiments demonstrate that the eight fine-grained metrics substantially outperform traditional binary outcome evaluation. Moreover, we identify four critical behavioral failure categories, whose qualitative patterns exhibit strong complementarity with quantitative findings. This work establishes the first interpretable, attributable, and reproducible fine-grained evaluation paradigm for LLMs’ social reasoning capabilities.
📝 Abstract
Recent studies have begun developing autonomous game players for social deduction games using large language models (LLMs). When building LLM players, fine-grained evaluations are crucial for addressing weaknesses in game-playing abilities. However, existing studies have often overlooked such assessments. Specifically, we point out two issues with the evaluation methods employed. First, game-playing abilities have typically been assessed through game-level outcomes rather than specific event-level skills; Second, error analyses have lacked structured methodologies. To address these issues, we propose an approach utilizing a variant of the SpyFall game, named SpyGame. We conducted an experiment with four LLMs, analyzing their gameplay behavior in SpyGame both quantitatively and qualitatively. For the quantitative analysis, we introduced eight metrics to resolve the first issue, revealing that these metrics are more effective than existing ones for evaluating the two critical skills: intent identification and camouflage. In the qualitative analysis, we performed thematic analysis to resolve the second issue. This analysis identifies four major categories that affect gameplay of LLMs. Additionally, we demonstrate how these categories complement and support the findings from the quantitative analysis.