Psychiatry-Bench: A Multi-Task Benchmark for LLMs in Psychiatry

📅 2025-09-07

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

Existing psychiatric assessment resources rely heavily on small-scale clinical corpora, social media data, or synthetic datasets, lacking clinical validity and failing to support complex tasks such as diagnostic reasoning and treatment planning. Method: We introduce PsychBench—the first multi-task benchmark grounded in authoritative psychiatric textbooks and real-world clinical cases—comprising 11 clinically meaningful question-answering tasks and over 5,300 expert-annotated instances. We propose a modular, scalable, textbook-driven evaluation framework and pioneer the use of LLM-as-judge for automated, scalable scoring. Contribution/Results: Comprehensive evaluation reveals substantial deficiencies in clinical consistency and safety across mainstream large language models, especially in longitudinal follow-up and management decision-making tasks. These findings underscore the critical need for domain-specific model optimization and rigorous, clinically informed evaluation paradigms.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) hold great promise in enhancing psychiatric practice, from improving diagnostic accuracy to streamlining clinical documentation and therapeutic support. However, existing evaluation resources heavily rely on small clinical interview corpora, social media posts, or synthetic dialogues, which limits their clinical validity and fails to capture the full complexity of psychiatric reasoning. In this work, we introduce PsychiatryBench, a rigorously curated benchmark grounded exclusively in authoritative, expert-validated psychiatric textbooks and casebooks. PsychiatryBench comprises eleven distinct question-answering tasks ranging from diagnostic reasoning and treatment planning to longitudinal follow-up, management planning, clinical approach, sequential case analysis, and multiple-choice/extended matching formats totaling over 5,300 expert-annotated items. We evaluate a diverse set of frontier LLMs (including Google Gemini, DeepSeek, LLaMA 3, and QWQ-32) alongside leading open-source medical models (e.g., OpenBiloLLM, MedGemma) using both conventional metrics and an "LLM-as-judge" similarity scoring framework. Our results reveal substantial gaps in clinical consistency and safety, particularly in multi-turn follow-up and management tasks, underscoring the need for specialized model tuning and more robust evaluation paradigms. PsychiatryBench offers a modular, extensible platform for benchmarking and improving LLM performance in high-stakes mental health applications.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' clinical validity in psychiatric reasoning using existing limited resources

Assessing LLM performance gaps in multi-turn clinical follow-up and management tasks

Providing specialized benchmarking for mental health applications requiring expert knowledge

Innovation

Methods, ideas, or system contributions that make the work stand out.

Expert-validated psychiatric textbook benchmark

Eleven distinct clinical reasoning tasks

LLM-as-judge similarity scoring framework

🔎 Similar Papers

No similar papers found.

Scale AI

$264,800—$331,000 USD

San Francisco / New York / Seattle

Research Scientist Intern, Multimodal AI (PhD)