Lived Experience Not Found: LLMs Struggle to Align with Experts on Addressing Adverse Drug Reactions from Psychiatric Medication Use

📅 2024-10-24

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

168K/year

🤖 AI Summary

This study investigates the clinical applicability of large language models (LLMs) in identifying psychiatric adverse drug reactions (ADRs) and generating harm-reduction strategies. To address the lack of domain-specific evaluation resources, we introduce Psych-ADR—the first high-quality, manually annotated dataset focused exclusively on psychiatric ADRs—and propose ADRA, a strategy-driven evaluation framework that quantifies model performance across four dimensions: ADR detection accuracy, sentiment/tone consistency, clinical actionability of generated strategies, and alignment with expert recommendations. Experimental results reveal critical limitations: LLM outputs exhibit poor readability and excessive complexity; expert alignment rates average only 70.86%, falling significantly short of clinical standards; and responses systematically underperform experts in conciseness and clinical operability. This work establishes the first open-source benchmark and methodology dedicated to evaluating LLMs in psychiatric ADR management, setting a new standard for assessing expert alignment in high-stakes healthcare applications.

Technology Category

Application Category

📝 Abstract

Adverse Drug Reactions (ADRs) from psychiatric medications are the leading cause of hospitalizations among mental health patients. With healthcare systems and online communities facing limitations in resolving ADR-related issues, Large Language Models (LLMs) have the potential to fill this gap. Despite the increasing capabilities of LLMs, past research has not explored their capabilities in detecting ADRs related to psychiatric medications or in providing effective harm reduction strategies. To address this, we introduce the Psych-ADR benchmark and the Adverse Drug Reaction Response Assessment (ADRA) framework to systematically evaluate LLM performance in detecting ADR expressions and delivering expert-aligned mitigation strategies. Our analyses show that LLMs struggle with understanding the nuances of ADRs and differentiating between types of ADRs. While LLMs align with experts in terms of expressed emotions and tone of the text, their responses are more complex, harder to read, and only 70.86% aligned with expert strategies. Furthermore, they provide less actionable advice by a margin of 12.32% on average. Our work provides a comprehensive benchmark and evaluation framework for assessing LLMs in strategy-driven tasks within high-risk domains.

Problem

Research questions and friction points this paper is trying to address.

Large Language Models

Psychopharmacological Side Effects

Expertise Comparison

Innovation

Methods, ideas, or system contributions that make the work stand out.

Psych-ADR Testing

ADR Assessment Framework

Large Language Model Evaluation

🔎 Similar Papers

Design and Evaluation of a CDSS for Drug Allergy Management Using LLMs and Pharmaceutical Data Integration