Moving Beyond Medical Exam Questions: A Clinician-Annotated Dataset of Real-World Tasks and Ambiguity in Mental Healthcare

📅 2025-02-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing medical language model (LM) benchmarks predominantly rely on multiple-choice questions, failing to capture the ambiguity and complexity inherent in real-world clinical decision-making. Method: We introduce the first realistic mental health decision-making dataset fully annotated by psychiatrists—without any LLM involvement—covering five core clinical tasks: treatment, diagnosis, documentation, monitoring, and triage. The dataset explicitly models clinical uncertainty, solution multiplicity, and controllable demographic de-identification (e.g., age). It innovatively incorporates expert-annotated uncertainty labels in a preference subset and adopts multi-option decision formats alongside preference learning objectives. Contribution/Results: We evaluate 11 general-purpose and 4 mental-health-finetuned LMs under zero-shot and fine-tuning settings. Our work is the first to quantify demographic bias effects and measure the consistency gap between free-form LM outputs and expert judgments—establishing a new clinical LM evaluation benchmark.

Technology Category

Application Category

📝 Abstract
Current medical language model (LM) benchmarks often over-simplify the complexities of day-to-day clinical practice tasks and instead rely on evaluating LMs on multiple-choice board exam questions. Thus, we present an expert-created and annotated dataset spanning five critical domains of decision-making in mental healthcare: treatment, diagnosis, documentation, monitoring, and triage. This dataset - created without any LM assistance - is designed to capture the nuanced clinical reasoning and daily ambiguities mental health practitioners encounter, reflecting the inherent complexities of care delivery that are missing from existing datasets. Almost all 203 base questions with five answer options each have had the decision-irrelevant demographic patient information removed and replaced with variables (e.g., AGE), and are available for male, female, or non-binary-coded patients. For question categories dealing with ambiguity and multiple valid answer options, we create a preference dataset with uncertainties from the expert annotations. We outline a series of intended use cases and demonstrate the usability of our dataset by evaluating eleven off-the-shelf and four mental health fine-tuned LMs on category-specific task accuracy, on the impact of patient demographic information on decision-making, and how consistently free-form responses deviate from human annotated samples.
Problem

Research questions and friction points this paper is trying to address.

Addresses limitations in existing medical language model benchmarks
Introduces a clinician-annotated dataset for mental healthcare tasks
Captures clinical reasoning and ambiguities in care delivery
Innovation

Methods, ideas, or system contributions that make the work stand out.

Expert-annotated mental healthcare dataset
Captures clinical reasoning and ambiguities
Evaluates language models on real tasks
🔎 Similar Papers
No similar papers found.
Max Lamparth
Max Lamparth
Research Fellow, Stanford University
Machine LearningUncertainty QuantificationInterpretabilityAI SafetyResponsible AI
D
D. Grabb
Stanford University
A
Amy Franks
University of Colorado
S
Scott Gershan
Northwestern University
K
Kaitlyn N. Kunstman
Northwestern University
A
Aaron Lulla
Stanford University
M
Monika Drummond Roots
University of Wisconsin
M
Manu Sharma
Yale University
A
Aryan Shrivastava
University of Chicago
N
Nina Vasan
Stanford University
C
Colleen Waickman
Ohio State University