MetaDent: Labeling Clinical Images for Vision-Language Models in Dentistry

📅 2026-04-16
📈 Citations: 0
Influential: 0
📄 PDF

career value

183K/year
🤖 AI Summary
This work addresses the lack of fine-grained annotations and comprehensive evaluation benchmarks for intraoral dental photography, which hinders the application of vision-language models (VLMs) in clinical dentistry. The authors propose a meta-annotation framework tailored to clinical dental images, constructing a large-scale dataset of 60,669 images, with 2,588 images annotated using a semi-structured scheme that combines image-level summaries and free-text descriptions of abnormal regions. Leveraging large language models, they generate approximately 15,000 standardized visual question-answer pairs and establish an 18-category multi-label classification benchmark, both rigorously validated by human experts alongside error analysis. This study presents the first comprehensive evaluation suite for dental VLMs, demonstrating through experiments that state-of-the-art models still exhibit significant limitations in fine-grained visual understanding tasks.

Technology Category

Application Category

📝 Abstract
Vision-Language Models (VLMs) have demonstrated significant potential in medical image analysis, yet their application in intraoral photography remains largely underexplored due to the lack of fine-grained, annotated datasets and comprehensive benchmarks. To address this, we present MetaDent, a comprehensive resource that includes (1) a novel and large-scale dentistry image dataset collected from clinical, public, and web sources; (2) a semi-structured annotation framework designed to capture the hierarchical and clinically nuanced nature of dental photography; and (3) comprehensive benchmark suites for evaluating state-of-the-art VLMs on clinical image understanding. Our labeling approach combines a high-level image summary with point-by-point, free-text descriptions of abnormalities. This method enables rich, scalable, and task-agnostic representations. We curated 60,669 dental images from diverse sources and annotated a representative subset of 2,588 images using this meta-labeling scheme. Leveraging Large Language Models (LLMs), we derive standardized benchmarks: approximately 15K Visual Question Answering (VQA) pairs and an 18-class multi-label classification dataset, which we validated with human review and error analysis to justify that the LLM-driven transition reliably preserves fidelity and semantic accuracy. We then evaluate state-of-the-art VLMs across VQA, classification, and image captioning tasks. Quantitative results reveal that even the most advanced models struggle with a fine-grained understanding of intraoral scenes, achieving moderate accuracy and producing inconsistent or incomplete descriptions in image captioning. We publicly release our dataset, annotations, and tools to foster reproducible research and accelerate the development of vision-language systems for dental applications.
Problem

Research questions and friction points this paper is trying to address.

Vision-Language Models
Intraoral Photography
Fine-grained Annotation
Dental Image Analysis
Benchmarking
Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-Language Models
Meta-labeling
Dental Image Annotation
Visual Question Answering
Large Language Models
🔎 Similar Papers
No similar papers found.
M
Meng-Xun Li
Department of Prosthodontics, State Key Laboratory of Oral & Maxillofacial Reconstruction and Regeneration, Key Laboratory of Oral Biomedicine, Ministry of Education, Hubei Key Laboratory of Stomatology, School and Hospital of Stomatology, Wuhan University, Wuhan, Hubei, China; School of Artificial Intelligence, Wuhan University, Wuhan, Hubei, China
W
Wen-Hui Deng
School of Computer Science, Wuhan University, Wuhan, Hubei, China
Z
Zhi-Xing Wu
Department of Prosthodontics, State Key Laboratory of Oral & Maxillofacial Reconstruction and Regeneration, Key Laboratory of Oral Biomedicine, Ministry of Education, Hubei Key Laboratory of Stomatology, School and Hospital of Stomatology, Wuhan University, Wuhan, Hubei, China
C
Chun-Xiao Jin
Department of Prosthodontics, State Key Laboratory of Oral & Maxillofacial Reconstruction and Regeneration, Key Laboratory of Oral Biomedicine, Ministry of Education, Hubei Key Laboratory of Stomatology, School and Hospital of Stomatology, Wuhan University, Wuhan, Hubei, China
J
Jia-Min Wu
Dental Materials Science, Clinical Artificial Intelligence, Faculty of Dentistry, The University of Hong Kong, Hong Kong SAR, China
Y
Yue Han
School of Electrical and Computer Engineering, Purdue University, West Lafayette, IN, USA
J
James Kit Hon Tsoi
Dental Materials Science, Clinical Artificial Intelligence, Faculty of Dentistry, The University of Hong Kong, Hong Kong SAR, China
Gui-Song Xia
Gui-Song Xia
School of Artificial Intelligence, Wuhan University, China
Artificial IntelligenceComputer VisionPhotogrammetryRemote SensingRobotics
C
Cui Huang
Department of Prosthodontics, State Key Laboratory of Oral & Maxillofacial Reconstruction and Regeneration, Key Laboratory of Oral Biomedicine, Ministry of Education, Hubei Key Laboratory of Stomatology, School and Hospital of Stomatology, Wuhan University, Wuhan, Hubei, China