🤖 AI Summary
This study systematically investigates how modality type—text, image, text-image fusion, and typographic image—affects personality embodiment in multimodal large language models (MLLMs). To this end, we construct the first parallel dataset covering 40 distinct personas with four-modality alignment, and design a fine-grained, 60-item human evaluation framework. We innovatively introduce *typographic images* as a novel modality and conduct the first cross-modal quantitative analysis of personality consistency. Experimental results reveal that detailed textual descriptions better elicit language-style traits, typographic images significantly enhance personality consistency, and mainstream MLLMs (e.g., LLaVA, Qwen-VL) generally fail to perceive personality-relevant visual details. All data, code, and evaluation tools are publicly released to support future research on personality modeling and multimodal alignment.
📝 Abstract
Large language models (LLMs) have recently demonstrated remarkable advancements in embodying diverse personas, enhancing their effectiveness as conversational agents and virtual assistants. Consequently, LLMs have made significant strides in processing and integrating multimodal information. However, even though human personas can be expressed in both text and image, the extent to which the modality of a persona impacts the embodiment by the LLM remains largely unexplored. In this paper, we investigate how do different modalities influence the expressiveness of personas in multimodal LLMs. To this end, we create a novel modality-parallel dataset of 40 diverse personas varying in age, gender, occupation, and location. This consists of four modalities to equivalently represent a persona: image-only, text-only, a combination of image and small text, and typographical images, where text is visually stylized to convey persona-related attributes. We then create a systematic evaluation framework with 60 questions and corresponding metrics to assess how well LLMs embody each persona across its attributes and scenarios. Comprehensive experiments on $5$ multimodal LLMs show that personas represented by detailed text show more linguistic habits, while typographical images often show more consistency with the persona. Our results reveal that LLMs often overlook persona-specific details conveyed through images, highlighting underlying limitations and paving the way for future research to bridge this gap. We release the data and code at https://github.com/claws-lab/persona-modality .