Expert Evaluation and the Limits of Human Feedback in Mental Health AI Safety Testing

📅 2026-01-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In AI safety evaluations for mental health applications, expert feedback is often treated as ground truth, yet its reliability remains questionable. This study engaged three board-certified psychiatrists to independently assess large language model–generated mental health responses using standardized scales. Inter-rater reliability was quantified via intraclass correlation coefficients (ICC) and Krippendorff’s alpha, complemented by qualitative interviews and calibrated rating instruments to explore sources of disagreement. Results revealed extremely low inter-rater reliability (ICC: 0.087–0.295), with even negative agreement (α = –0.203) on critical safety items such as suicidal or self-harm content. These discrepancies were not random but stemmed from systematic differences in clinical philosophies—such as prioritizing safety, promoting client engagement, or emphasizing cultural sensitivity—thereby challenging conventional evaluation paradigms that rely on aggregated labels and uncovering profound diversity in professional judgment.

Technology Category

Application Category

📝 Abstract
Learning from human feedback~(LHF) assumes that expert judgments, appropriately aggregated, yield valid ground truth for training and evaluating AI systems. We tested this assumption in mental health, where high safety stakes make expert consensus essential. Three certified psychiatrists independently evaluated LLM-generated responses using a calibrated rubric. Despite similar training and shared instructions, inter-rater reliability was consistently poor ($ICC$ $0.087$--$0.295$), falling below thresholds considered acceptable for consequential assessment. Disagreement was highest on the most safety-critical items. Suicide and self-harm responses produced greater divergence than any other category, and was systematic rather than random. One factor yielded negative reliability (Krippendorff's $\alpha = -0.203$), indicating structured disagreement worse than chance. Qualitative interviews revealed that disagreement reflects coherent but incompatible individual clinical frameworks, safety-first, engagement-centered, and culturally-informed orientations, rather than measurement error. By demonstrating that experts rely on holistic risk heuristics rather than granular factor discrimination, these findings suggest that aggregated labels function as arithmetic compromises that effectively erase grounded professional philosophies. Our results characterize expert disagreement in safety-critical AI as a sociotechnical phenomenon where professional experience introduces sophisticated layers of principled divergence. We discuss implications for reward modeling, safety classification, and evaluation benchmarks, recommending that practitioners shift from consensus-based aggregation to alignment methods that preserve and learn from expert disagreement.
Problem

Research questions and friction points this paper is trying to address.

expert disagreement
human feedback
AI safety
mental health
inter-rater reliability
Innovation

Methods, ideas, or system contributions that make the work stand out.

expert disagreement
human feedback
AI safety
mental health
alignment