OralMLLM-Bench: Evaluating Cognitive Capabilities of Multimodal Large Language Models in Dental Practice

📅 2026-05-02

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

This study addresses the lack of systematic evaluation of multimodal large language models (MLLMs) in dental imaging analysis with respect to their hierarchical cognitive capabilities. We propose the first benchmark framework aligned with clinical dental reasoning, encompassing three imaging modalities—periapical radiographs, panoramic radiographs, and cephalometric radiographs—and four cognitive levels: perception, comprehension, prediction, and decision-making. The framework includes 27 clinical tasks, expert-annotated data, and 3,820 physician evaluations. Experiments with state-of-the-art MLLMs, including GPT-5.2 and GLM-4.6, quantitatively measure performance gaps relative to human experts, revealing critical limitations and failure modes in real-world diagnostic scenarios. These findings provide essential insights and actionable directions for developing safe, reliable AI systems in clinical dentistry.

📝 Abstract

Multimodal large language models (MLLMs) have emerged as a promising paradigm for dental image analysis. However, their ability to capture the multi-level cognitive processes required for radiographic analysis remains unclear. Here, we present a comprehensive benchmark to evaluate the cognitive capabilities of MLLMs in dental radiographic analysis. It spans three critical imaging modalities, i.e., periapical, panoramic, and lateral cephalometric radiographs, and defines four cognitive categories: perception, comprehension, prediction, and decision-making. The benchmark comprises 27 clinically grounded tasks derived from public datasets, with manually curated annotations and 3,820 clinician assessments for evaluation. Six frontier MLLMs, including GPT-5.2 and GLM-4.6, are evaluated. We demonstrate the performance gap between MLLMs and clinicians in dental practice, delineate model strengths and limitations, characterize failure patterns, and provide recommendations for improvement. This data resource will facilitate the development of next-generation artificial intelligence systems aligned with clinical cognition, safety requirements, and workflow complexity in dental practice.

Problem

Research questions and friction points this paper is trying to address.

Multimodal Large Language Models

Dental Radiographic Analysis

Cognitive Capabilities

Clinical Benchmarking

Artificial Intelligence in Dentistry

Innovation

Methods, ideas, or system contributions that make the work stand out.

multimodal large language models

dental radiographic analysis

cognitive benchmark