IOSVLM: A 3D Vision-Language Model for Unified Dental Diagnosis from Intraoral Scans

📅 2026-03-17

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

This work addresses the limitation of existing dental vision-language models in effectively leveraging the native 3D geometric information from intraoral scans (IOS), which hinders unified multi-disease diagnosis. To overcome this, we propose IOSVLM, an end-to-end 3D vision-language model that represents IOS as point clouds and integrates a 3D encoder, a projector, and a large language model to enable unified diagnosis and generative visual question answering grounded in 3D geometry. To bridge the distribution gap between colorless IOS data and color-dependent 3D pretraining, we design a geometry-to-color proxy mechanism and adopt a two-stage curriculum learning strategy to enhance robustness. We also introduce IOSVQA, a large-scale, multi-source VQA dataset for IOS-based diagnosis. Experiments show that IOSVLM significantly outperforms strong baselines, achieving a 9.58% gain in macro accuracy and a 1.46% improvement in macro F1, validating the efficacy of directly modeling 3D geometry.

Technology Category

Application Category

📝 Abstract

3D intraoral scans (IOS) are increasingly adopted in routine dentistry due to abundant geometric evidence, and unified multi-disease diagnosis is desirable for clinical documentation and communication. While recent works introduce dental vision-language models (VLMs) to enable unified diagnosis and report generation on 2D images or multi-view images rendered from IOS, they do not fully leverage native 3D geometry. Such work is necessary and also challenging, due to: (i) heterogeneous scan forms and the complex IOS topology, (ii) multi-disease co-occurrence with class imbalance and fine-grained morphological ambiguity, (iii) limited paired 3D IOS-text data. Thus, we present IOSVLM, an end-to-end 3D VLM that represents scans as point clouds and follows a 3D encoder-projector-LLM design for unified diagnosis and generative visual question-answering (VQA), together with IOSVQA, a large-scale multi-source IOS diagnosis VQA dataset comprising 19,002 cases and 249,055 VQA pairs over 23 oral diseases and heterogeneous scan types. To address the distribution gap between color-free IOS data and color-dependent 3D pre-training, we propose a geometry-to-chromatic proxy that stabilizes fine-grained geometric perception and cross-modal alignment. A two-stage curriculum training strategy further enhances robustness. IOSVLM consistently outperforms strong baselines, achieving gains of at least +9.58% macro accuracy and +1.46% macro F1, indicating the effectiveness of direct 3D geometry modeling for IOS-based diagnosis.

Problem

Research questions and friction points this paper is trying to address.

3D intraoral scans

unified dental diagnosis

vision-language model

multi-disease co-occurrence

3D geometry

Innovation

Methods, ideas, or system contributions that make the work stand out.

3D Vision-Language Model

Intraoral Scans

Point Cloud Representation