🤖 AI Summary
This study investigates whether large language models (LLMs) are clinically competent for real-world pediatric practice. Method: We introduce PEDIASBench—the first systematic, pediatric-specific evaluation framework—comprising 19 subspecialties and 211 diseases, assessing LLMs across three dimensions: foundational knowledge application, dynamic clinical reasoning, and medical safety & ethics. It innovatively incorporates dynamic case-based reasoning, humanistic care assessment, and multi-level ethical safety testing, using multiple-choice questions, structured case analyses, and contextual judgment tasks. Contribution/Results: Comprehensive evaluation of 12 state-of-the-art LLMs reveals that top-performing models achieve >90% accuracy on foundational examinations but exhibit ~15% performance degradation in complex reasoning, real-time decision-making, and ethical sensitivity. These findings confirm that current LLMs are not yet suitable for autonomous clinical practice in pediatrics; however, they demonstrate strong potential as clinical decision-support tools and educational aids in medical training.
📝 Abstract
With the rapid rise of large language models (LLMs) in medicine, a key question is whether they can function as competent pediatricians in real-world clinical settings. We developed PEDIASBench, a systematic evaluation framework centered on a knowledge-system framework and tailored to realistic clinical environments. PEDIASBench assesses LLMs across three dimensions: application of basic knowledge, dynamic diagnosis and treatment capability, and pediatric medical safety and medical ethics. We evaluated 12 representative models released over the past two years, including GPT-4o, Qwen3-235B-A22B, and DeepSeek-V3, covering 19 pediatric subspecialties and 211 prototypical diseases. State-of-the-art models performed well on foundational knowledge, with Qwen3-235B-A22B achieving over 90% accuracy on licensing-level questions, but performance declined ~15% as task complexity increased, revealing limitations in complex reasoning. Multiple-choice assessments highlighted weaknesses in integrative reasoning and knowledge recall. In dynamic diagnosis and treatment scenarios, DeepSeek-R1 scored highest in case reasoning (mean 0.58), yet most models struggled to adapt to real-time patient changes. On pediatric medical ethics and safety tasks, Qwen2.5-72B performed best (accuracy 92.05%), though humanistic sensitivity remained limited. These findings indicate that pediatric LLMs are constrained by limited dynamic decision-making and underdeveloped humanistic care. Future development should focus on multimodal integration and a clinical feedback-model iteration loop to enhance safety, interpretability, and human-AI collaboration. While current LLMs cannot independently perform pediatric care, they hold promise for decision support, medical education, and patient communication, laying the groundwork for a safe, trustworthy, and collaborative intelligent pediatric healthcare system.