Can Large Language Models Function as Qualified Pediatricians? A Systematic Evaluation in Real-World Clinical Contexts

📅 2025-11-17

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

This study investigates whether large language models (LLMs) are clinically competent for real-world pediatric practice. Method: We introduce PEDIASBench—the first systematic, pediatric-specific evaluation framework—comprising 19 subspecialties and 211 diseases, assessing LLMs across three dimensions: foundational knowledge application, dynamic clinical reasoning, and medical safety & ethics. It innovatively incorporates dynamic case-based reasoning, humanistic care assessment, and multi-level ethical safety testing, using multiple-choice questions, structured case analyses, and contextual judgment tasks. Contribution/Results: Comprehensive evaluation of 12 state-of-the-art LLMs reveals that top-performing models achieve >90% accuracy on foundational examinations but exhibit ~15% performance degradation in complex reasoning, real-time decision-making, and ethical sensitivity. These findings confirm that current LLMs are not yet suitable for autonomous clinical practice in pediatrics; however, they demonstrate strong potential as clinical decision-support tools and educational aids in medical training.

Technology Category

Application Category

📝 Abstract

With the rapid rise of large language models (LLMs) in medicine, a key question is whether they can function as competent pediatricians in real-world clinical settings. We developed PEDIASBench, a systematic evaluation framework centered on a knowledge-system framework and tailored to realistic clinical environments. PEDIASBench assesses LLMs across three dimensions: application of basic knowledge, dynamic diagnosis and treatment capability, and pediatric medical safety and medical ethics. We evaluated 12 representative models released over the past two years, including GPT-4o, Qwen3-235B-A22B, and DeepSeek-V3, covering 19 pediatric subspecialties and 211 prototypical diseases. State-of-the-art models performed well on foundational knowledge, with Qwen3-235B-A22B achieving over 90% accuracy on licensing-level questions, but performance declined ~15% as task complexity increased, revealing limitations in complex reasoning. Multiple-choice assessments highlighted weaknesses in integrative reasoning and knowledge recall. In dynamic diagnosis and treatment scenarios, DeepSeek-R1 scored highest in case reasoning (mean 0.58), yet most models struggled to adapt to real-time patient changes. On pediatric medical ethics and safety tasks, Qwen2.5-72B performed best (accuracy 92.05%), though humanistic sensitivity remained limited. These findings indicate that pediatric LLMs are constrained by limited dynamic decision-making and underdeveloped humanistic care. Future development should focus on multimodal integration and a clinical feedback-model iteration loop to enhance safety, interpretability, and human-AI collaboration. While current LLMs cannot independently perform pediatric care, they hold promise for decision support, medical education, and patient communication, laying the groundwork for a safe, trustworthy, and collaborative intelligent pediatric healthcare system.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' ability to function as qualified pediatricians in clinical settings

Assessing limitations in complex reasoning and dynamic diagnosis capabilities

Identifying gaps in pediatric medical ethics and humanistic care performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Developed PEDIASBench for systematic pediatric LLM evaluation

Assessed models across knowledge, diagnosis, and ethics dimensions

Proposed multimodal integration and clinical feedback iteration loop

🔎 Similar Papers

No similar papers found.