Reasoning Is Not All You Need: Examining LLMs for Multi-Turn Mental Health Conversations

📅 2025-05-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing LLM-based mental health dialogue research prioritizes diagnostic accuracy while neglecting alignment with patients’ goals, values, and personality traits. Method: We propose MedAgent, a framework that synthesizes over 2,200 realistic, multi-turn psychotherapeutic dialogues, and introduce MultiSenseEval—a novel, human-centered, multidimensional evaluation paradigm targeting sensemaking, encompassing goal consistency, value alignment, and persona-aware empathy modeling. Contribution/Results: Experiments reveal that state-of-the-art reasoning models achieve only 31% average performance on patient-centered communication; performance degrades significantly across dialogue turns and is strongly moderated by personality traits. This work constitutes the first systematic demonstration of structural failure in LLMs for long-horizon, human-centered mental health dialogues, establishing a new benchmark and methodological foundation for trustworthy, person-centered AI in mental healthcare.

Technology Category

Application Category

📝 Abstract
Limited access to mental healthcare, extended wait times, and increasing capabilities of Large Language Models (LLMs) has led individuals to turn to LLMs for fulfilling their mental health needs. However, examining the multi-turn mental health conversation capabilities of LLMs remains under-explored. Existing evaluation frameworks typically focus on diagnostic accuracy and win-rates and often overlook alignment with patient-specific goals, values, and personalities required for meaningful conversations. To address this, we introduce MedAgent, a novel framework for synthetically generating realistic, multi-turn mental health sensemaking conversations and use it to create the Mental Health Sensemaking Dialogue (MHSD) dataset, comprising over 2,200 patient-LLM conversations. Additionally, we present MultiSenseEval, a holistic framework to evaluate the multi-turn conversation abilities of LLMs in healthcare settings using human-centric criteria. Our findings reveal that frontier reasoning models yield below-par performance for patient-centric communication and struggle at advanced diagnostic capabilities with average score of 31%. Additionally, we observed variation in model performance based on patient's persona and performance drop with increasing turns in the conversation. Our work provides a comprehensive synthetic data generation framework, a dataset and evaluation framework for assessing LLMs in multi-turn mental health conversations.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs for multi-turn mental health conversation capabilities
Addressing gaps in patient-specific goal alignment in existing frameworks
Assessing LLM performance in advanced diagnostic and patient-centric communication
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces MedAgent for synthetic mental health conversations
Creates MHSD dataset with 2,200 patient-LLM dialogues
Proposes MultiSenseEval for human-centric LLM evaluation
🔎 Similar Papers
No similar papers found.