GlobalDentBench: A Multinational Benchmark for Evaluating LLM Clinical Reasoning in Dentistry with Expert Calibration

📅 2026-05-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the lack of systematic evaluation of robustness and safety in large language models (LLMs) for dental clinical reasoning. The authors introduce the first multinational dental benchmark, encompassing 88 countries across six continents, featuring 8,978 expert-validated questions spanning 14 specialties. They propose a novel three-tiered clinical reasoning difficulty framework—knowledge recall, routine reasoning, and personalized reasoning—and incorporate diverse question formats, including multiple-choice, short-answer, and case-based analyses. Leveraging an expert-calibrated automated construction pipeline, the benchmark achieves 99.98% consistency in question generation. Evaluation of 12 state-of-the-art LLMs reveals alarming limitations: accuracy drops to 22.34% on case-based questions and further declines to 35.71% across the three reasoning tiers, with 31.01% of responses posing safety risks—including 4.51% potentially causing irreversible harm—highlighting significant gaps in current models’ readiness for real-world clinical deployment.
📝 Abstract
While large language models (LLMs) hold transformative potential for medicine, their reasoning robustness and safety in real-world clinical scenarios remain critically underexplored, particularly in dentistry. Here we introduce GlobalDentBench, the first multinational dental benchmark, featuring a taxonomy that encompasses 14 dental specialties across 88 countries and regions spanning six continents. The benchmark comprises 8,978 expert-validated questions across three formats (multiple-choice, short-answer, and case-based questions) and assesses three progressive reasoning levels: knowledge recall (L1), routine reasoning (L2), and individualized reasoning (L3). To ensure data quality, the automated construction framework was calibrated by six senior dentists, achieving expert agreement rates of 99.98% for multiple-choice and short-answer questions and 96.78% for the more complex case-based questions. Evaluation of 12 frontier LLMs on GlobalDentBench revealed a sharp, stepwise performance degradation with increasing reasoning complexity. Specifically, accuracy plummeted from 81.34% on multiple-choice to 64.53% on short-answer and 22.34% on case-based questions, while declining markedly from 74.01% at L1 to 55.64% at L2 and 35.71% at L3. More critically, risk analysis of real-world dental cases demonstrated an alarming overall unsafe rate of 31.01% in LLM-generated clinical recommendations, with 4.51% posing risks of irreversible patient harm and risks particularly pronounced in specialties such as orthodontics. These findings expose fundamental limitations in the medical reasoning and safety of current LLMs. Consequently, GlobalDentBench provides a scalable foundation for trustworthy clinical AI evaluation, underscoring the urgent need for rigorous validation before the safe deployment of these models in healthcare.
Problem

Research questions and friction points this paper is trying to address.

large language models
clinical reasoning
dental AI safety
multinational benchmark
LLM evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

GlobalDentBench
clinical reasoning evaluation
expert-calibrated benchmark
LLM safety in dentistry
multinational dental dataset
🔎 Similar Papers
No similar papers found.
Junjie Zhao
Junjie Zhao
北京大学硕士生
CVML
J
Jingyi Liang
School of Data Science, The Chinese University of Hong Kong, Shenzhen.
Zhenyang Cai
Zhenyang Cai
The Chinese University of Hong Kong, Shenzhen
Large Language Models
J
Jiaming Zhang
Department of Periodontology, Shenzhen Stomatology Hospital (Pingshan) of Southern Medical University, Shenzhen, China.
Z
Zhenwei Wen
Division of Applied Oral Sciences and Community Dental Care, Faculty of Dentistry, The University of Hong Kong.
S
Shuzhi Deng
Department of Periodontology, Shenzhen Stomatology Hospital (Pingshan) of Southern Medical University, Shenzhen, China.
W
Wenjing Yi
Department of Orthodontics, Shenzhen Stomatology Hospital (Pingshan) of Southern Medical University, Shenzhen, China.
C
Chunfeng Luo
Shenzhen Stomatology Hospital (Pingshan) of Southern Medical University, Shenzhen, China.
H
Hexian Zhang
Division of Applied Oral Sciences and Community Dental Care, Faculty of Dentistry, The University of Hong Kong.
Junying Chen
Junying Chen
The Chinese University of Hong Kong, Shenzhen
Large Language Models
Tianrui Liu
Tianrui Liu
Associate Professor at National University of Defense Technology, PhD Imperial College London
Computer VisionMedical Image AnalysisDeep Learning
Z
Zhuhui Bai
Department of Periodontology, Shenzhen Stomatology Hospital (Pingshan) of Southern Medical University, Shenzhen, China.
Z
Zixu Zhang
School of Data Science, The Chinese University of Hong Kong, Shenzhen.
Pradeep Singh
Pradeep Singh
Professor of Mechanical Engineering, Sant Longowal Institute of Engineering & Technology, Longowal
Tolerance Design of Mechanical AssembliesConcurrent Engineering – Design for Manufacture and AssemblyModelling & Simulatio
X
Xiang Liu
College of Future Technology, Peking University.
J
Jianquan Li
Freedom AI.
Nhan L Tran
Nhan L Tran
Professor, Cancer Biology and Neurosurgery, Mayo Clinic
Neuro-Oncologygenomicscell biologysignal transductiondrug discovey
Falk Schwendicke
Falk Schwendicke
Professor, LMU Munich
AICariologyHealth economicsPublic HealthRestorative Dentistry
Z
Zuolin Jin
Shenzhen Stomatology Hospital (Pingshan) of Southern Medical University, Shenzhen, China.
L
Lijian Jin
Division of Periodontology & Implant Dentistry, Faculty of Dentistry, The University of Hong Kong, Hong Kong, SAR, China
L
Liangyi Chen
New Cornerstone Science Laboratory, National Biomedical Imaging Center, State Key Laboratory of Membrane Biology, Institute of Molecular Medicine, Peking-Tsinghua Center for Life Sciences, College of Future Technology, Peking University, Beijing 100871, China; IDG/McGovern Institute for Brain Research, Peking University, Beijing 100871, China.
W
Wei-fa Yang
Division of Oral and Maxillofacial Surgery, Faculty of Dentistry, The University of Hong Kong.
Benyou Wang
Benyou Wang
Assistant Professor, The Chinese University of Hong Kong, Shenzhen
large language modelsnatural language processinginformation retrievalapplied machine learning
Junwen Wang
Junwen Wang
Faculty of Dentistry, The University of Hong Kong
BioinformaticsComputational GenomicsSystems BiologyPrecision DentistryPrecision Medicine
S
Shan Jiang
Department of Periodontology, Shenzhen Stomatology Hospital (Pingshan) of Southern Medical University, Shenzhen, China.