Reliability of Large Language Model Generated Clinical Reasoning in Assisted Reproductive Technology: Blinded Comparative Evaluation Study

📅 2025-10-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates the reliability of large language models (LLMs) in generating clinical reasoning chains for assisted reproductive technology (ART). To address evaluation challenges, we propose a “dual-principle” framework—“gold-standard depth” and “representative diversity”—revealing that example quality, not quantity, predominantly governs reasoning fidelity, and underscoring the irreplaceable role of blinded clinical expert evaluation in high-stakes medical AI assessment. Through controlled experiments comparing zero-shot, random few-shot, and selective few-shot prompting—evaluated via both blinded clinician review and GPT-4o automated assessment—we find that selective few-shot prompting significantly improves reasoning credibility (p < .001). Key contributions include: (1) establishing high-quality, clinically diverse exemplars as the primary gain factor for robust reasoning; (2) empirically demonstrating limitations of AI-based evaluators on critical clinical dimensions (e.g., safety, guideline adherence); and (3) introducing a scalable paradigm for constructing trustworthy medical reasoning datasets.

Technology Category

Application Category

📝 Abstract
Creating high-quality clinical Chains-of-Thought (CoTs) is crucial for explainable medical Artificial Intelligence (AI) while constrained by data scarcity. Although Large Language Models (LLMs) can synthesize medical data, their clinical reliability remains unverified. This study evaluates the reliability of LLM-generated CoTs and investigates prompting strategies to enhance their quality. In a blinded comparative study, senior clinicians in Assisted Reproductive Technology (ART) evaluated CoTs generated via three distinct strategies: Zero-shot, Random Few-shot (using shallow examples), and Selective Few-shot (using diverse, high-quality examples). These expert ratings were compared against evaluations from a state-of-the-art AI model (GPT-4o). The Selective Few-shot strategy significantly outperformed other strategies across all human evaluation metrics (p < .001). Critically, the Random Few-shot strategy offered no significant improvement over the Zero-shot baseline, demonstrating that low-quality examples are as ineffective as no examples. The success of the Selective strategy is attributed to two principles: "Gold-Standard Depth" (reasoning quality) and "Representative Diversity" (generalization). Notably, the AI evaluator failed to discern these critical performance differences. The clinical reliability of synthetic CoTs is dictated by strategic prompt curation, not the mere presence of examples. We propose a "Dual Principles" framework as a foundational methodology to generate trustworthy data at scale. This work offers a validated solution to the data bottleneck and confirms the indispensable role of human expertise in evaluating high-stakes clinical AI.
Problem

Research questions and friction points this paper is trying to address.

Evaluating reliability of LLM-generated clinical reasoning in reproductive medicine
Investigating prompting strategies to enhance medical AI reasoning quality
Addressing data scarcity for trustworthy clinical AI through strategic curation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Selective Few-shot prompting enhances clinical reasoning quality
Dual Principles framework ensures reasoning depth and diversity
Human expertise validates AI-generated clinical chains-of-thought
🔎 Similar Papers
No similar papers found.
D
Dou Liu
Department of Obstetrics and Gynecology, West China Second University Hospital, Sichuan University, Chengdu, China
Y
Ying Long
Department of Obstetrics and Gynecology, West China Second University Hospital, Sichuan University, Chengdu, China
S
Sophia Zuoqiu
Department of Industrial Engineering, Sichuan University, Chengdu, China
D
Di Liu
Department of Industrial Engineering, Sichuan University, Chengdu, China
K
Kang Li
West China Biomedical Big Data Center, West China Hospital, Sichuan University, Chengdu, China
Y
Yiting Lin
West China School of Medicine, Sichuan University, Chengdu, China
H
Hanyi Liu
West China School of Medicine, Sichuan University, Chengdu, China
Rong Yin
Rong Yin
Associate Researcher, Institute of Information Engineering, Chinese Academy of Sciences
LLMGraph Representation LearningStatistical Learning Theory
Tian Tang
Tian Tang
university of alberta