MHGraphBench: Knowledge Graph-Grounded Benchmarking of Mental Health Knowledge in Large Language Models

๐Ÿ“… 2026-05-14
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF

career value

185K/year
๐Ÿค– AI Summary
This study evaluates the capacity of large language models (LLMs) to comprehend biomedical knowledge and perform structured clinical reasoning in the domain of mental health. To this end, the authors construct the first mental healthโ€“focused subgraph derived from PrimeKG and introduce a multitask evaluation framework supporting named entity recognition, relation classification, and two-hop reasoning, enhanced with controlled negative samples to ensure assessment rigor. Experiments across 15 prominent LLMs reveal that while models approach performance ceilings in entity recognition, they exhibit substantial deficiencies in relation prediction and multi-hop reasoning. Furthermore, contextual integration of knowledge graphs benefits some models but hinders others, and output formatting critically influences evaluation outcomes. This work establishes the first structured, extensible benchmark for assessing LLMs in mental health applications.
๐Ÿ“ Abstract
Large language models (LLMs) are increasingly used in the mental health domain, yet it remains unclear how well they capture related biomedical knowledge and how reliably they apply it to clinically salient structured judgments. Here, we present a knowledge-graph (KG)-grounded benchmark for assessing LLMs on mental-health entity recognition, relation judgment, and two-hop reasoning. The benchmark is derived from PrimeKG and comprises nine task families with KG-supported answers and controlled negative options. Experiments across 15 closed- and open-source LLMs reveal a persistent recognition-to-judgment gap: leading models achieve near-ceiling performance on entity typing and on the small relation-typing subset, yet they still struggle with relation prediction and two-hop reasoning. Additionally, short KG-derived snippets benefit some models but degrade performance for others. Moreover, output-format reliability can substantially influence measured performance under constrained multiple-choice settings, highlighting the critical role of response validity in benchmark-based evaluation. MHGraphBench should therefore be interpreted as evaluating agreement with a curated mental-health slice of PrimeKG under a constrained multiple-choice interface, rather than as a direct assessment of real-world clinical safety.
Problem

Research questions and friction points this paper is trying to address.

mental health
large language models
knowledge graph
benchmarking
biomedical knowledge
Innovation

Methods, ideas, or system contributions that make the work stand out.

knowledge graph
mental health benchmarking
large language models
two-hop reasoning
structured evaluation
๐Ÿ”Ž Similar Papers
No similar papers found.
Weixin Liu
Weixin Liu
Baidu Inc.
Natural Language ProcessingMachine LearningDeep Learning
C
Congning Ni
Vanderbilt University Medical Center, Nashville, TN, USA
Shelagh A. Mulvaney
Shelagh A. Mulvaney
Vanderbilt University
digital health interventionsself-managementdiabetessocial learningmomentary assessment
S
Susannah L. Rose
Vanderbilt University Medical Center, Nashville, TN, USA
Murat Kantarcioglu
Murat Kantarcioglu
Professor of Computer Science, Virginia Tech
Security and Privacy in AIDatabasesData ScienceComputer Security
B
Bradley A. Malin
Vanderbilt University, Nashville, TN, USA; Vanderbilt University Medical Center, Nashville, TN, USA
Z
Zhijun Yin
Vanderbilt University, Nashville, TN, USA; Vanderbilt University Medical Center, Nashville, TN, USA