Evaluating the Feasibility and Accuracy of Large Language Models for Medical History-Taking in Obstetrics and Gynecology

📅 2025-03-31
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study evaluates the feasibility and accuracy of large language models (LLMs) in automating obstetric-gynecologic history-taking—specifically for infertility, a sensitive and clinically complex domain. Method: A dual conversational agent system was developed using ChatGPT-4o and 4o-mini, trained and evaluated on 70 real-world cases comprising 420 clinical histories. It represents the first systematic, specialty-specific comparison of these models in reproductive medicine. Performance was assessed multidimensionally using F1 score, diagnostic discrimination accuracy (DDs), infertility type judgment (ITJ) accuracy, and inter-annotator reliability (Cronbach’s α). Contribution/Results: The lightweight 4o-mini significantly outperformed 4o in history completeness (+20.47%; p = 0.045) and information extraction accuracy, achieving an F1 score of 0.9258 and completeness of 97.58%. These findings challenge the assumption that larger parameter count inherently yields superior clinical performance. The study provides robust empirical evidence supporting the clinical viability of LLM-driven structured interrogation in reproductive medicine.

Technology Category

Application Category

📝 Abstract
Effective physician-patient communications in pre-diagnostic environments, and most specifically in complex and sensitive medical areas such as infertility, are critical but consume a lot of time and, therefore, cause clinic workflows to become inefficient. Recent advancements in Large Language Models (LLMs) offer a potential solution for automating conversational medical history-taking and improving diagnostic accuracy. This study evaluates the feasibility and performance of LLMs in those tasks for infertility cases. An AI-driven conversational system was developed to simulate physician-patient interactions with ChatGPT-4o and ChatGPT-4o-mini. A total of 70 real-world infertility cases were processed, generating 420 diagnostic histories. Model performance was assessed using F1 score, Differential Diagnosis (DDs) Accuracy, and Accuracy of Infertility Type Judgment (ITJ). ChatGPT-4o-mini outperformed ChatGPT-4o in information extraction accuracy (F1 score: 0.9258 vs. 0.9029, p = 0.045, d = 0.244) and demonstrated higher completeness in medical history-taking (97.58% vs. 77.11%), suggesting that ChatGPT-4o-mini is more effective in extracting detailed patient information, which is critical for improving diagnostic accuracy. In contrast, ChatGPT-4o performed slightly better in differential diagnosis accuracy (2.0524 vs. 2.0048, p>0.05). ITJ accuracy was higher in ChatGPT-4o-mini (0.6476 vs. 0.5905) but with lower consistency (Cronbach's $alpha$ = 0.562), suggesting variability in classification reliability. Both models demonstrated strong feasibility in automating infertility history-taking, with ChatGPT-4o-mini excelling in completeness and extraction accuracy. In future studies, expert validation for accuracy and dependability in a clinical setting, AI model fine-tuning, and larger datasets with a mix of cases of infertility have to be prioritized.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs for medical history-taking in obstetrics and gynecology
Assessing feasibility and accuracy of AI in infertility case diagnostics
Comparing ChatGPT-4o and ChatGPT-4o-mini for diagnostic performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

AI-driven system simulates doctor-patient conversations
ChatGPT-4o-mini excels in medical history extraction
Models assessed via F1 score and diagnosis accuracy
🔎 Similar Papers
No similar papers found.
D
Dou Liu
Department of Industrial and Operation Engineering, University of Michigan, Ann Arbor, US
Y
Ying Long
Center for Reproductive Medicine, Department of Gynecology and Obstetrics, West China Second University Hospital, Sichuan University, Chengdu, China
S
Sophia Zuoqiu
Department of Industrial Engineering, Sichuan University, Chengdu, China
Tian Tang
Tian Tang
university of alberta
Rong Yin
Rong Yin
Associate Researcher, Institute of Information Engineering, Chinese Academy of Sciences
LLMGraph Representation LearningStatistical Learning Theory