QUACK: Questioning, Understanding, and Auditing Communicated Knowledge in Multimodal Social Deduction Agents

๐Ÿ“… 2026-05-26
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses the limitations of prevailing evaluation paradigms for social reasoning agents, which predominantly rely on game outcomes and fail to detect fine-grained misalignments between language and perceptionโ€”such as spatial hallucinations or unfounded accusations. To bridge this gap, we propose the first automated auditing framework for multimodal social agents that evaluates linguistic grounding through a tripartite assessment integrating game results, behavioral trajectories, and statement consistency. Leveraging game engine logs, our approach reconstructs ground-truth agent trajectories and implements a claim-level verification pipeline powered by vision-language models. The framework supports robust evaluation of both adversarial isomorphic and heterogeneous agents. Experimental results reveal that even state-of-the-art models exhibit spatial hallucinations in 15.1% of their statements and issue empirically unsupported accusations in over half of relevant cases, exposing a significant disconnect between their reasoning processes and linguistic outputs.
๐Ÿ“ Abstract
Social deduction games have become a popular testbed for probing reasoning, deception, coordination, and belief modeling in Large Language Model (LLM) agents. However, most environments are scored only by game outcomes such as win rates and largely remain to text-only interaction, making it difficult to tell whether an agent's language is actually grounded in what it perceived and did, or to identify the failure modes underlying its behavior. To address this gap, we introduce QUACK, an open-source environment and evaluation framework for auditing the grounding of agent language in multimodal social reasoning. QUACK evaluates agents at three levels: game outcomes, behavioral trajectories, and utterance-level consistency. Its core Statement Verification Pipeline reconstructs each agent's ground-truth trajectory from engine logs and checks every discussion claim against it, automatically flagging spatial hallucination, unsupported accusation, deception collapse, and language-action inconsistency. Evaluating three frontier VLMs in both homogeneous and cross-model adversarial settings, we find that even the strongest agent hallucinates 15.1% of its verifiable spatial claims and makes over half of its accusations without grounded evidence. We release the full engine, evaluation framework, toolkit, and logs at https://github.com/AAAAA-Academia-Attractions/QUACK.
Problem

Research questions and friction points this paper is trying to address.

social deduction
grounding
multimodal reasoning
language-action consistency
agent evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

multimodal social deduction
statement verification pipeline
language grounding audit
spatial hallucination detection
VLM evaluation framework
Ye Yuan
Ye Yuan
McGill University, Mila - Quebec AI Institute
Generative ModelingBlack Box OptimizationKnowledge-Centric NLPLLMs
R
Rui Song
McGill University
W
Weien Li
McGill University
Z
Zeyu Li
McGill University
H
Haochen Liu
University of Cambridge
X
Xiangyu Kong
McGill University, Mila - Quebec AI Institute
C
Changjiang Han
MBZUAI - Mohamed bin Zayed University of Artificial Intelligence
Y
Yonghan Yang
MBZUAI - Mohamed bin Zayed University of Artificial Intelligence
Z
Zichen Zhao
MBZUAI - Mohamed bin Zayed University of Artificial Intelligence
Zixuan Dong
Zixuan Dong
New York University
Reinforcement LearningDeep LearningNeural Collapse
Fuyuan Lyu
Fuyuan Lyu
McGill University / Mila - Quebec AI Institute
Data-Centric AIData MiningLLM EvaluationInference Scaling
Bowei He
Bowei He
City University of Hong Kong, MBZUAI
Data MiningLanguage ModelGenAI4ScienceAgentic AI
Haolun Wu
Haolun Wu
Researcher at Mila, McGill, Stanford | Prev. intern at Google, DeepMind, MSR
Knowledge RepresentationInformation RetrievalHuman-centric AI
Jikun Kang
Jikun Kang
LMTS at Salesforce
Machine LeanringReinforcement Learning
X
Xue Liu
MBZUAI - Mohamed bin Zayed University of Artificial Intelligence, McGill University, Mila - Quebec AI Institute