A Women's Health Benchmark for Large Language Models

📅 2025-12-18

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

Large language models (LLMs) lack systematic, clinically rigorous evaluation for accuracy in women’s health—a critical gap given safety implications. Method: We introduce WHB, the first clinically validated, multidimensional benchmark specifically designed for women’s health. WHB spans five gynecologic and reproductive specialties, three user roles (patients, medical students, clinicians), and eight clinically grounded error types, evaluated via 96 expert-validated “model probe questions” enabling fine-grained assessment. Contribution/Results: Evaluating 13 state-of-the-art LLMs, we find an overall failure rate of 60%, with particularly severe deficiencies in detecting urgent clinical indications—e.g., missed red-flag symptoms. While newer models (e.g., GPT-5) significantly reduce inappropriate recommendations, they remain broadly deficient in urgency recognition. This work establishes the first standardized evaluation framework for LLMs in women’s health, providing both methodological foundations and empirical evidence to advance safe, reliable AI-assisted healthcare.

Technology Category

Application Category

📝 Abstract

As large language models (LLMs) become primary sources of health information for millions, their accuracy in women's health remains critically unexamined. We introduce the Women's Health Benchmark (WHB), the first benchmark evaluating LLM performance specifically in women's health. Our benchmark comprises 96 rigorously validated model stumps covering five medical specialties (obstetrics and gynecology, emergency medicine, primary care, oncology, and neurology), three query types (patient query, clinician query, and evidence/policy query), and eight error types (dosage/medication errors, missing critical information, outdated guidelines/treatment recommendations, incorrect treatment advice, incorrect factual information, missing/incorrect differential diagnosis, missed urgency, and inappropriate recommendations). We evaluated 13 state-of-the-art LLMs and revealed alarming gaps: current models show approximately 60% failure rates on the women's health benchmark, with performance varying dramatically across specialties and error types. Notably, models universally struggle with "missed urgency" indicators, while newer models like GPT-5 show significant improvements in avoiding inappropriate recommendations. Our findings underscore that AI chatbots are not yet fully able of providing reliable advice in women's health.

Problem

Research questions and friction points this paper is trying to address.

Evaluates LLM accuracy in women's health information

Identifies critical error types like missed urgency in responses

Assesses performance gaps across medical specialties and queries

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces first women's health benchmark for LLMs

Evaluates 13 models across 96 validated medical stumps

Reveals 60% failure rates with urgent care gaps

🔎 Similar Papers

CHBench: A Chinese Dataset for Evaluating Health in Large Language Models