🤖 AI Summary
This study proposes a novel approach to evaluating personality-like traits in large multimodal language models (LMMs) by focusing on their implicit cognitive and affective functions through nonverbal modalities. For the first time, the Thematic Apperception Test (TAT) from clinical psychology is integrated with the Social Cognition and Object Relations Scale–Global Rating (SCORS-G) to assess LMMs: TAT images prompt narrative generation, which is then scored by another LMM using SCORS-G criteria, establishing a generative–evaluative dual-agent framework. This method enables cross-modal, nonverbal assessment of personified capabilities, yielding results highly consistent with human expert ratings. The findings reveal that LMMs generally excel at understanding interpersonal dynamics and self-concept but exhibit systematic deficits in recognizing and regulating aggression, with performance improving consistently across model scales and version updates.
📝 Abstract
Thematic Apperception Test (TAT) is a psychometrically grounded, multidimensional assessment framework that systematically differentiates between cognitive-representational and affective-relational components of personality-like functioning. This test is a projective psychological framework designed to uncover unconscious aspects of personality. This study examines whether the personality traits of Large Multimodal Models (LMMs) can be assessed through non-language-based modalities, using the Social Cognition and Object Relations Scale - Global (SCORS-G). LMMs are employed in two distinct roles: as subject models (SMs), which generate stories in response to TAT images, and as evaluator models (EMs), who assess these narratives using the SCORS-G framework. Evaluators demonstrated an excellent ability to understand and analyze TAT responses. Their interpretations are highly consistent with those of human experts. Assessment results highlight that all models understand interpersonal dynamics very well and have a good grasp of the concept of self. However, they consistently fail to perceive and regulate aggression. Performance varied systematically across model families, with larger and more recent models consistently outperforming smaller and earlier ones across SCORS-G dimensions.