PersonaX: Multimodal Datasets with LLM-Inferred Behavior Traits

📅 2025-09-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing research lacks large-scale multimodal personality datasets integrating behavioral descriptors, facial images, and biographical information, hindering cross-modal modeling of human behavioral traits. Method: We introduce PersonaX—a scalable multimodal dataset comprising CelebPersona and AthlePersona—covering over 10,000 public figures and athletes with behavioral trait annotations, facial images, and structured biographical data. We propose Causal Representation Learning (CRL), a theoretically identifiable causal inference framework for multimodal and multi-measurement settings. CRL jointly processes textual, visual, and structured data using three state-of-the-art large language models and validates causal relationships via statistical independence tests. Contribution/Results: Empirical evaluation on synthetic and real-world data confirms robust cross-modal associations between facial/biographical features and behavioral traits. PersonaX establishes the first reproducible, extensible, causally grounded multimodal benchmark for personalized AI and computational social science.

Technology Category

Application Category

📝 Abstract
Understanding human behavior traits is central to applications in human-computer interaction, computational social science, and personalized AI systems. Such understanding often requires integrating multiple modalities to capture nuanced patterns and relationships. However, existing resources rarely provide datasets that combine behavioral descriptors with complementary modalities such as facial attributes and biographical information. To address this gap, we present PersonaX, a curated collection of multimodal datasets designed to enable comprehensive analysis of public traits across modalities. PersonaX consists of (1) CelebPersona, featuring 9444 public figures from diverse occupations, and (2) AthlePersona, covering 4181 professional athletes across 7 major sports leagues. Each dataset includes behavioral trait assessments inferred by three high-performing large language models, alongside facial imagery and structured biographical features. We analyze PersonaX at two complementary levels. First, we abstract high-level trait scores from text descriptions and apply five statistical independence tests to examine their relationships with other modalities. Second, we introduce a novel causal representation learning (CRL) framework tailored to multimodal and multi-measurement data, providing theoretical identifiability guarantees. Experiments on both synthetic and real-world data demonstrate the effectiveness of our approach. By unifying structured and unstructured analysis, PersonaX establishes a foundation for studying LLM-inferred behavioral traits in conjunction with visual and biographical attributes, advancing multimodal trait analysis and causal reasoning.
Problem

Research questions and friction points this paper is trying to address.

Multimodal datasets combining behavioral traits with facial attributes
Analyzing relationships between LLM-inferred traits and biographical features
Developing causal representation learning for multimodal trait analysis
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses LLM-inferred behavioral trait assessments
Integrates facial imagery with biographical features
Introduces causal representation learning framework
🔎 Similar Papers
No similar papers found.
Loka Li
Loka Li
Mohamed bin Zayed University of Artificial Intelligence
Machine LearningCausality
W
Wong Yu Kang
Mohamed bin Zayed University of Artificial Intelligence
M
Minghao Fu
University of California San Diego
G
Guangyi Chen
Mohamed bin Zayed University of Artificial Intelligence, Carnegie Mellon University
Zhenhao Chen
Zhenhao Chen
MBZUAI
CausalityMachine LearningRepresentation LearningLLMMultimodal AI
G
Gongxu Luo
Mohamed bin Zayed University of Artificial Intelligence
Yuewen Sun
Yuewen Sun
Mohamed bin Zayed University of Artificial Intelligence
CausalityReinforcement learningRepresentation Learning
S
Salman Khan
Mohamed bin Zayed University of Artificial Intelligence, Australian National University
Peter Spirtes
Peter Spirtes
Professor of Philosophy, Carnegie Mellon University
Machine LearningCausal Inference
K
Kun Zhang
Mohamed bin Zayed University of Artificial Intelligence, Carnegie Mellon University