PediaBench: A Comprehensive Chinese Pediatric Dataset for Benchmarking Large Language Models

📅 2024-12-09
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
A standardized benchmark for evaluating large language models’ (LLMs) question-answering capabilities in the Chinese pediatric domain is currently lacking—particularly one encompassing both objective and subjective question types, multiple disease categories, and clinical reasoning dimensions. Method: We introduce PedQA, the first comprehensive Chinese pediatric LLM evaluation benchmark, comprising 5,749 structured QA items spanning 12 common pediatric conditions. We propose a multidimensional evaluation framework tailored to Chinese pediatric contexts, featuring a novel hierarchical difficulty scoring mechanism that jointly assesses instruction adherence, medical knowledge comprehension, and clinical reasoning. PedQA is constructed from real-world clinical knowledge, annotated via multi-granularity human annotation and cross-validation, and evaluated using a dual-track assessment integrating objective accuracy and subjective generation quality. Results: Systematic evaluation across 20 mainstream open-source and commercial LLMs reveals substantial performance gaps in pediatric reasoning, providing empirical evidence and an openly available benchmark to guide model improvement.

Technology Category

Application Category

📝 Abstract
The emergence of Large Language Models (LLMs) in the medical domain has stressed a compelling need for standard datasets to evaluate their question-answering (QA) performance. Although there have been several benchmark datasets for medical QA, they either cover common knowledge across different departments or are specific to another department rather than pediatrics. Moreover, some of them are limited to objective questions and do not measure the generation capacity of LLMs. Therefore, they cannot comprehensively assess the QA ability of LLMs in pediatrics. To fill this gap, we construct PediaBench, the first Chinese pediatric dataset for LLM evaluation. Specifically, it contains 4,117 objective questions and 1,632 subjective questions spanning 12 pediatric disease groups. It adopts an integrated scoring criterion based on different difficulty levels to thoroughly assess the proficiency of an LLM in instruction following, knowledge understanding, clinical case analysis, etc. Finally, we validate the effectiveness of PediaBench with extensive experiments on 20 open-source and commercial LLMs. Through an in-depth analysis of experimental results, we offer insights into the ability of LLMs to answer pediatric questions in the Chinese context, highlighting their limitations for further improvements. Our code and data are published at https://github.com/ACMISLab/PediaBench.
Problem

Research questions and friction points this paper is trying to address.

Lack of standard datasets for evaluating LLMs in pediatric medical QA.
Existing datasets do not comprehensively assess LLM capabilities in pediatrics.
Need for a dataset to evaluate both objective and subjective pediatric questions.
Innovation

Methods, ideas, or system contributions that make the work stand out.

First Chinese pediatric dataset for LLMs
Combines objective and subjective questions
Integrated scoring for diverse LLM abilities
🔎 Similar Papers
Q
Qian Zhang
State Key Laboratory of Public Big Data, College of Computer Science and Technology, Guizhou University, Guiyang, China; Guizhou Engineering Laboratory for Advanced Computing and Medical Information Services, Guiyang, China
P
Panfeng Chen
State Key Laboratory of Public Big Data, College of Computer Science and Technology, Guizhou University, Guiyang, China; Guizhou Engineering Laboratory for Advanced Computing and Medical Information Services, Guiyang, China
J
Jiali Li
State Key Laboratory of Public Big Data, College of Computer Science and Technology, Guizhou University, Guiyang, China; Guizhou Engineering Laboratory for Advanced Computing and Medical Information Services, Guiyang, China
L
Linkun Feng
State Key Laboratory of Public Big Data, College of Computer Science and Technology, Guizhou University, Guiyang, China; Guizhou Engineering Laboratory for Advanced Computing and Medical Information Services, Guiyang, China
Shuyu Liu
Shuyu Liu
Professor in wheat breeding and genetics, Texas A&M University
wheatbreedinggeneticsgenomics
Heng Zhao
Heng Zhao
The Rockefeller University
Image RestorationDeep LearningInverse Problem
M
Mei Chen
State Key Laboratory of Public Big Data, College of Computer Science and Technology, Guizhou University, Guiyang, China; Guizhou Engineering Laboratory for Advanced Computing and Medical Information Services, Guiyang, China
H
Hui Li
State Key Laboratory of Public Big Data, College of Computer Science and Technology, Guizhou University, Guiyang, China; Guizhou Engineering Laboratory for Advanced Computing and Medical Information Services, Guiyang, China
Y
Yanhao Wang
School of Data Science and Engineering, East China Normal University, Shanghai, China