Perception or Prejudice: Can MLLMs Go Beyond First Impressions of Personality?

📅 2026-05-21

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

This work addresses a critical limitation in current evaluations of multimodal large language models (MLLMs) for personality assessment, which focus solely on predicting Big Five trait scores without discerning whether judgments stem from genuine behavioral understanding or superficial cues. To this end, the authors formalize the novel task of “embodied personality reasoning,” requiring models to justify trait ratings with observable behavioral evidence. They introduce MM-OCEAN, the first multimodal dataset comprising videos, multiple-choice questions, and timestamp annotations. Using a three-tiered evaluation framework—scoring, reasoning, and attribution—and four fine-grained failure-mode metrics, experiments across 27 prominent MLLMs reveal that 51% of correct scores lack behavioral justification and attribution rates range merely from 0% to 33.5%, exposing a pervasive “correct answer, wrong reasoning” phenomenon and highlighting deep-seated deficiencies in current systems’ capacity for grounded personality inference.

📝 Abstract

Multimodal Large Language Models (MLLMs) are increasingly deployed in human-facing roles where personality perception is critical, yet existing benchmarks evaluate this capability solely on numerical Big Five score prediction, leaving open whether models truly perceive personality through behavioral understanding or merely prejudge through superficial pattern matching. We address this gap with three contributions. (i) A new task: we formalize Grounded Personality Reasoning (GPR), which requires MLLMs to anchor each Big Five rating in observable evidence through a chain of rating, reasoning, and grounding. (ii) A new dataset: we release MM-OCEAN (1,104 videos, 5,320 MCQs), produced by a multi-agent pipeline with human verification, with timestamped behavioral observations, evidence-grounded trait analyses, and seven categories of cue-grounding MCQs. (iii) Benchmark and analysis: we design a three-tier evaluation (rating, reasoning, grounding) plus four sample-level failure-mode metrics: Prejudice Rate (PR), Confabulation Rate (CR), Integration-failure Rate (IR), and Holistic-grounding Rate (HR), and benchmark 27 MLLMs (13 closed, 14 open). The analysis uncovers a striking Prejudice Gap: across the field, 51% of correct ratings are not grounded in retrieved cues, and the Holistic-Grounding Rate spans only 0-33.5%. These findings expose a disconnect between getting the right score and reasoning for the right reason, charting a roadmap for grounded social cognition in MLLMs.

Problem

Research questions and friction points this paper is trying to address.

personality perception

Multimodal Large Language Models

bias

behavioral understanding

Big Five

Innovation

Methods, ideas, or system contributions that make the work stand out.

Grounded Personality Reasoning

MM-OCEAN dataset

multimodal large language models