🤖 AI Summary
Generative psychoanalysis in natural dialogue faces two core challenges: (1) visual-language models struggle to disambiguate speech-related articulatory motions from affective facial expressions—termed *articulation–affect ambiguity*; and (2) the absence of a verifiable, fine-grained evaluation framework. To address these, we propose MIND, a hierarchical visual encoder featuring a *state-judgment module* that suppresses lip-motion interference to achieve visual decoupling of linguistic and affective features. We further introduce ConvoInsight-DB, a novel dialogue dataset with micro-expression annotations, and PRISM—an automated evaluation framework integrating micro-expression labeling, temporal variance analysis, and expert-guided large-model scoring. On the PRISM benchmark, our model achieves an 86.95% improvement in micro-expression detection over prior SOTA. Ablation studies confirm the state-judgment module as the key architectural innovation.
📝 Abstract
Generative psychological analysis of in-the-wild conversations faces two fundamental challenges: (1) existing Vision-Language Models (VLMs) fail to resolve Articulatory-Affective Ambiguity, where visual patterns of speech mimic emotional expressions; and (2) progress is stifled by a lack of verifiable evaluation metrics capable of assessing visual grounding and reasoning depth. We propose a complete ecosystem to address these twin challenges. First, we introduce Multilevel Insight Network for Disentanglement(MIND), a novel hierarchical visual encoder that introduces a Status Judgment module to algorithmically suppress ambiguous lip features based on their temporal feature variance, achieving explicit visual disentanglement. Second, we construct ConvoInsight-DB, a new large-scale dataset with expert annotations for micro-expressions and deep psychological inference. Third, Third, we designed the Mental Reasoning Insight Rating Metric (PRISM), an automated dimensional framework that uses expert-guided LLM to measure the multidimensional performance of large mental vision models. On our PRISM benchmark, MIND significantly outperforms all baselines, achieving a +86.95% gain in micro-expression detection over prior SOTA. Ablation studies confirm that our Status Judgment disentanglement module is the most critical component for this performance leap. Our code has been opened.