Measuring the Unspoken: A Disentanglement Model and Benchmark for Psychological Analysis in the Wild

📅 2025-12-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Generative psychoanalysis in natural dialogue faces two core challenges: (1) visual-language models struggle to disambiguate speech-related articulatory motions from affective facial expressions—termed *articulation–affect ambiguity*; and (2) the absence of a verifiable, fine-grained evaluation framework. To address these, we propose MIND, a hierarchical visual encoder featuring a *state-judgment module* that suppresses lip-motion interference to achieve visual decoupling of linguistic and affective features. We further introduce ConvoInsight-DB, a novel dialogue dataset with micro-expression annotations, and PRISM—an automated evaluation framework integrating micro-expression labeling, temporal variance analysis, and expert-guided large-model scoring. On the PRISM benchmark, our model achieves an 86.95% improvement in micro-expression detection over prior SOTA. Ablation studies confirm the state-judgment module as the key architectural innovation.

Technology Category

Application Category

📝 Abstract
Generative psychological analysis of in-the-wild conversations faces two fundamental challenges: (1) existing Vision-Language Models (VLMs) fail to resolve Articulatory-Affective Ambiguity, where visual patterns of speech mimic emotional expressions; and (2) progress is stifled by a lack of verifiable evaluation metrics capable of assessing visual grounding and reasoning depth. We propose a complete ecosystem to address these twin challenges. First, we introduce Multilevel Insight Network for Disentanglement(MIND), a novel hierarchical visual encoder that introduces a Status Judgment module to algorithmically suppress ambiguous lip features based on their temporal feature variance, achieving explicit visual disentanglement. Second, we construct ConvoInsight-DB, a new large-scale dataset with expert annotations for micro-expressions and deep psychological inference. Third, Third, we designed the Mental Reasoning Insight Rating Metric (PRISM), an automated dimensional framework that uses expert-guided LLM to measure the multidimensional performance of large mental vision models. On our PRISM benchmark, MIND significantly outperforms all baselines, achieving a +86.95% gain in micro-expression detection over prior SOTA. Ablation studies confirm that our Status Judgment disentanglement module is the most critical component for this performance leap. Our code has been opened.
Problem

Research questions and friction points this paper is trying to address.

Resolves articulatory-affective ambiguity in visual speech patterns
Addresses lack of verifiable metrics for visual grounding and reasoning
Enables psychological analysis in wild conversations via disentanglement
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical visual encoder for lip feature disentanglement
Large-scale dataset with expert psychological annotations
Automated metric using LLM for multidimensional evaluation
🔎 Similar Papers
No similar papers found.
Y
Yigui Feng
College of Computer Science, National University of Defense Technology
Qinglin Wang
Qinglin Wang
National University of Defense Technology
Parallel algorithmsHigh Performance ComputingDeep LearningMachine LearningGPU
H
Haotian Mo
College of Computer Science, National University of Defense Technology
Y
Yang Liu
Shien-Ming Wu School of Intelligent Engineering, South China University of Technology
K
Ke Liu
College of Computer Science, National University of Defense Technology
G
Gencheng Liu
College of Computer Science, National University of Defense Technology
X
Xinhai Chen
College of Computer Science, National University of Defense Technology
S
Siqi Shen
School of Informatics Xiamen University
S
Songzhu Mei
College of Computer Science, National University of Defense Technology
J
Jie Liu
College of Computer Science, National University of Defense Technology