Reasoning Like Experts: Leveraging Multimodal Large Language Models for Drawing-based Psychoanalysis

📅 2025-10-22

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

This study addresses the subjectivity and expert-dependence of psychological drawing analysis (e.g., House-Tree-Person test) by proposing PICK, a novel multi-step analytical framework that integrates a psychology knowledge base and a reinforcement learning–driven feature extraction module into a multimodal large language model (MLLM). Methodologically, PICK employs image decomposition, hierarchical semantic parsing, and dynamic object modeling to establish an interpretable mapping from local visual cues to holistic psychological states. Contributions include: (1) enabling MLLMs to achieve human-expert–level reasoning in psychological assessment; (2) introducing the first cognitive framework that jointly incorporates structured psychological knowledge and adaptive visual representation learning across multiple abstraction levels; and (3) empirically validating strong cross-domain generalization—particularly in subjective tasks such as emotion understanding. Experiments demonstrate significant improvements in both assessment accuracy and interpretability over prior approaches.

Technology Category

Application Category

📝 Abstract

Multimodal Large Language Models (MLLMs) have demonstrated exceptional performance across various objective multimodal perception tasks, yet their application to subjective, emotionally nuanced domains, such as psychological analysis, remains largely unexplored. In this paper, we introduce PICK, a multi-step framework designed for Psychoanalytical Image Comprehension through hierarchical analysis and Knowledge injection with MLLMs, specifically focusing on the House-Tree-Person (HTP) Test, a widely used psychological assessment in clinical practice. First, we decompose drawings containing multiple instances into semantically meaningful sub-drawings, constructing a hierarchical representation that captures spatial structure and content across three levels: single-object level, multi-object level, and whole level. Next, we analyze these sub-drawings at each level with a targeted focus, extracting psychological or emotional insights from their visual cues. We also introduce an HTP knowledge base and design a feature extraction module, trained with reinforcement learning, to generate a psychological profile for single-object level analysis. This profile captures both holistic stylistic features and dynamic object-specific features (such as those of the house, tree, or person), correlating them with psychological states. Finally, we integrate these multi-faceted information to produce a well-informed assessment that aligns with expert-level reasoning. Our approach bridges the gap between MLLMs and specialized expert domains, offering a structured and interpretable framework for understanding human mental states through visual expression. Experimental results demonstrate that the proposed PICK significantly enhances the capability of MLLMs in psychological analysis. It is further validated as a general framework through extensions to emotion understanding tasks.

Problem

Research questions and friction points this paper is trying to address.

Applying MLLMs to subjective psychological analysis tasks

Interpreting drawings for psychoanalysis using hierarchical decomposition

Generating expert-level psychological assessments from visual cues

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical decomposition of drawings into semantic sub-components

Reinforcement learning trained feature extraction for psychological profiling

Integration of expert knowledge base with multimodal analysis

🔎 Similar Papers

Aligning Large Language Models for Enhancing Psychiatric Interviews Through Symptom Delineation and Summarization: Pilot Study