Can MLLMs Read the Room? A Multimodal Benchmark for Verifying Truthfulness in Multi-Party Social Interactions

📅 2025-10-31

📈 Citations: 0

✨ Influential: 0

career value

221K/year

🤖 AI Summary

Detecting truthfulness in dynamic, multi-person social interactions remains a critical yet underexplored challenge in multimodal AI. Method: We introduce MIVA—the first benchmark task for multimodal veracity identification—alongside a synchronized audio-video–text dataset grounded in the social deduction game *Werewolf*, annotated with utterance-level ground-truth labels. We conduct the first systematic evaluation of state-of-the-art multimodal large language models (MLLMs) on authentic social reasoning, employing end-to-end fine-grained assessment. Contribution/Results: Our analysis uncovers a significant gap between visual social cue localization and linguistic reasoning in MLLMs, as well as an over-conservative alignment bias toward defaulting to “truthful” predictions. Even top-performing models (e.g., GPT-4o) fall substantially short of practical utility, exposing fundamental limitations in social perception and trustworthy multimodal inference. This work establishes a new benchmark, dataset, and empirical insights for advancing trustworthy multimodal human-AI interaction.

Technology Category

Application Category

📝 Abstract

As AI systems become increasingly integrated into human lives, endowing them with robust social intelligence has emerged as a critical frontier. A key aspect of this intelligence is discerning truth from deception, a ubiquitous element of human interaction that is conveyed through a complex interplay of verbal language and non-verbal visual cues. However, automatic deception detection in dynamic, multi-party conversations remains a significant challenge. The recent rise of powerful Multimodal Large Language Models (MLLMs), with their impressive abilities in visual and textual understanding, makes them natural candidates for this task. Consequently, their capabilities in this crucial domain are mostly unquantified. To address this gap, we introduce a new task, Multimodal Interactive Veracity Assessment (MIVA), and present a novel multimodal dataset derived from the social deduction game Werewolf. This dataset provides synchronized video, text, with verifiable ground-truth labels for every statement. We establish a comprehensive benchmark evaluating state-of-the-art MLLMs, revealing a significant performance gap: even powerful models like GPT-4o struggle to distinguish truth from falsehood reliably. Our analysis of failure modes indicates that these models fail to ground language in visual social cues effectively and may be overly conservative in their alignment, highlighting the urgent need for novel approaches to building more perceptive and trustworthy AI systems.

Problem

Research questions and friction points this paper is trying to address.

Detecting deception in multi-party conversations using multimodal cues

Evaluating MLLMs' capability to discern truth from falsehood in social interactions

Addressing performance gaps in grounding language with visual social cues

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal dataset from Werewolf game

Benchmark evaluating MLLMs on deception detection

Analyzing failure modes in multimodal grounding

🔎 Similar Papers

No similar papers found.