Lived Experience Not Found: LLMs Struggle to Align with Experts on Addressing Adverse Drug Reactions from Psychiatric Medication Use

📅 2024-10-24
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates the clinical applicability of large language models (LLMs) in identifying psychiatric adverse drug reactions (ADRs) and generating harm-reduction strategies. To address the lack of domain-specific evaluation resources, we introduce Psych-ADR—the first high-quality, manually annotated dataset focused exclusively on psychiatric ADRs—and propose ADRA, a strategy-driven evaluation framework that quantifies model performance across four dimensions: ADR detection accuracy, sentiment/tone consistency, clinical actionability of generated strategies, and alignment with expert recommendations. Experimental results reveal critical limitations: LLM outputs exhibit poor readability and excessive complexity; expert alignment rates average only 70.86%, falling significantly short of clinical standards; and responses systematically underperform experts in conciseness and clinical operability. This work establishes the first open-source benchmark and methodology dedicated to evaluating LLMs in psychiatric ADR management, setting a new standard for assessing expert alignment in high-stakes healthcare applications.

Technology Category

Application Category

📝 Abstract
Adverse Drug Reactions (ADRs) from psychiatric medications are the leading cause of hospitalizations among mental health patients. With healthcare systems and online communities facing limitations in resolving ADR-related issues, Large Language Models (LLMs) have the potential to fill this gap. Despite the increasing capabilities of LLMs, past research has not explored their capabilities in detecting ADRs related to psychiatric medications or in providing effective harm reduction strategies. To address this, we introduce the Psych-ADR benchmark and the Adverse Drug Reaction Response Assessment (ADRA) framework to systematically evaluate LLM performance in detecting ADR expressions and delivering expert-aligned mitigation strategies. Our analyses show that LLMs struggle with understanding the nuances of ADRs and differentiating between types of ADRs. While LLMs align with experts in terms of expressed emotions and tone of the text, their responses are more complex, harder to read, and only 70.86% aligned with expert strategies. Furthermore, they provide less actionable advice by a margin of 12.32% on average. Our work provides a comprehensive benchmark and evaluation framework for assessing LLMs in strategy-driven tasks within high-risk domains.
Problem

Research questions and friction points this paper is trying to address.

Large Language Models
Psychopharmacological Side Effects
Expertise Comparison
Innovation

Methods, ideas, or system contributions that make the work stand out.

Psych-ADR Testing
ADR Assessment Framework
Large Language Model Evaluation
🔎 Similar Papers
No similar papers found.