🤖 AI Summary
Large language models (LLMs) lack systematic, clinically rigorous evaluation for accuracy in women’s health—a critical gap given safety implications. Method: We introduce WHB, the first clinically validated, multidimensional benchmark specifically designed for women’s health. WHB spans five gynecologic and reproductive specialties, three user roles (patients, medical students, clinicians), and eight clinically grounded error types, evaluated via 96 expert-validated “model probe questions” enabling fine-grained assessment. Contribution/Results: Evaluating 13 state-of-the-art LLMs, we find an overall failure rate of 60%, with particularly severe deficiencies in detecting urgent clinical indications—e.g., missed red-flag symptoms. While newer models (e.g., GPT-5) significantly reduce inappropriate recommendations, they remain broadly deficient in urgency recognition. This work establishes the first standardized evaluation framework for LLMs in women’s health, providing both methodological foundations and empirical evidence to advance safe, reliable AI-assisted healthcare.
📝 Abstract
As large language models (LLMs) become primary sources of health information for millions, their accuracy in women's health remains critically unexamined. We introduce the Women's Health Benchmark (WHB), the first benchmark evaluating LLM performance specifically in women's health. Our benchmark comprises 96 rigorously validated model stumps covering five medical specialties (obstetrics and gynecology, emergency medicine, primary care, oncology, and neurology), three query types (patient query, clinician query, and evidence/policy query), and eight error types (dosage/medication errors, missing critical information, outdated guidelines/treatment recommendations, incorrect treatment advice, incorrect factual information, missing/incorrect differential diagnosis, missed urgency, and inappropriate recommendations). We evaluated 13 state-of-the-art LLMs and revealed alarming gaps: current models show approximately 60% failure rates on the women's health benchmark, with performance varying dramatically across specialties and error types. Notably, models universally struggle with "missed urgency" indicators, while newer models like GPT-5 show significant improvements in avoiding inappropriate recommendations. Our findings underscore that AI chatbots are not yet fully able of providing reliable advice in women's health.