A Women's Health Benchmark for Large Language Models

📅 2025-12-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) lack systematic, clinically rigorous evaluation for accuracy in women’s health—a critical gap given safety implications. Method: We introduce WHB, the first clinically validated, multidimensional benchmark specifically designed for women’s health. WHB spans five gynecologic and reproductive specialties, three user roles (patients, medical students, clinicians), and eight clinically grounded error types, evaluated via 96 expert-validated “model probe questions” enabling fine-grained assessment. Contribution/Results: Evaluating 13 state-of-the-art LLMs, we find an overall failure rate of 60%, with particularly severe deficiencies in detecting urgent clinical indications—e.g., missed red-flag symptoms. While newer models (e.g., GPT-5) significantly reduce inappropriate recommendations, they remain broadly deficient in urgency recognition. This work establishes the first standardized evaluation framework for LLMs in women’s health, providing both methodological foundations and empirical evidence to advance safe, reliable AI-assisted healthcare.

Technology Category

Application Category

📝 Abstract
As large language models (LLMs) become primary sources of health information for millions, their accuracy in women's health remains critically unexamined. We introduce the Women's Health Benchmark (WHB), the first benchmark evaluating LLM performance specifically in women's health. Our benchmark comprises 96 rigorously validated model stumps covering five medical specialties (obstetrics and gynecology, emergency medicine, primary care, oncology, and neurology), three query types (patient query, clinician query, and evidence/policy query), and eight error types (dosage/medication errors, missing critical information, outdated guidelines/treatment recommendations, incorrect treatment advice, incorrect factual information, missing/incorrect differential diagnosis, missed urgency, and inappropriate recommendations). We evaluated 13 state-of-the-art LLMs and revealed alarming gaps: current models show approximately 60% failure rates on the women's health benchmark, with performance varying dramatically across specialties and error types. Notably, models universally struggle with "missed urgency" indicators, while newer models like GPT-5 show significant improvements in avoiding inappropriate recommendations. Our findings underscore that AI chatbots are not yet fully able of providing reliable advice in women's health.
Problem

Research questions and friction points this paper is trying to address.

Evaluates LLM accuracy in women's health information
Identifies critical error types like missed urgency in responses
Assesses performance gaps across medical specialties and queries
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces first women's health benchmark for LLMs
Evaluates 13 models across 96 validated medical stumps
Reveals 60% failure rates with urgent care gaps
🔎 Similar Papers
No similar papers found.
V
Victoria-Elisabeth Gruber
Lumos AI
Razvan Marinescu
Razvan Marinescu
Assistant Professor, UC Santa Cruz, Computer Science and Engineering, Genomics Institute
Machine LearningDifferentiable SimulatorsBayesian ModelingMedical Image AnalysisMRI
D
Diego Fajardo
Lumos AI
A
Amin H. Nassar
Medical Oncology, Yale Cancer Center
C
Christopher Arkfeld
Obstetrics and Gynecology, MGH, Harvard Medical School
A
Alexandria Ludlow
Obstetrics, Gynecology & Reproductive Sciences, UCSF
S
Shama Patel
Brown Division of Global Emergency Medicine
M
Mehrnoosh Samaei
Department of Emergency Medicine, Emory University
V
Valerie Klug
Pharmacy Department, Clinic Ottakring
A
Anna Huber
Pharmacy Department, Clinic Ottakring
M
Marcel Gühner
Pharmacy Department, Clinic Ottakring
A
Albert Botta i Orfila
Pharmacy Department, Clinic Ottakring
I
Irene Lagoja
Pharmacy Department, Clinic Ottakring
K
Kimya Tarr
Windrush Surgery, Buckinghamshire, Oxfordshire and Berkshire West Integrated Care Board, NHS
H
Haleigh Larson
Women’s Health Research, Yale School of Medicine
M
Mary Beth Howard
Johns Hopkins University School of Medicine