GMAI-VL & GMAI-VL-5.5M: A Large Vision-Language Model and A Comprehensive Multimodal Dataset Towards General Medical AI

📅 2024-11-21
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
General-purpose medical AI systems face limitations in clinical deployment due to insufficient domain-specific knowledge and inadequate multimodal data processing capabilities. To address this, we introduce GMAI-VL-5.5M—the first large-scale, specialized medical vision-language dataset comprising 5.5 million high-quality image–text pairs—and propose a novel three-stage vision-language joint training paradigm: (1) visual encoder initialization, (2) image–text alignment fine-tuning, and (3) task-aware refinement, enabling deep integration of ViT-based encoders with large language models. Our methodological innovations include: (1) multi-source medical data cleaning and structured pairing augmentation; (2) a clinical-scenario-oriented progressive pretraining strategy; and (3) a unified cross-task, cross-modal modeling framework. Extensive evaluations demonstrate state-of-the-art performance across medical visual question answering, lesion detection, and radiology report generation—outperforming CLIP, Med-PaLM, and PMC-VL—with superior generalizability and clinically interpretable outputs.

Technology Category

Application Category

📝 Abstract
Despite significant advancements in general AI, its effectiveness in the medical domain is limited by the lack of specialized medical knowledge. To address this, we formulate GMAI-VL-5.5M, a multimodal medical dataset created by converting hundreds of specialized medical datasets with various annotations into high-quality image-text pairs. This dataset offers comprehensive task coverage, diverse modalities, and rich image-text data. Building upon this dataset, we develop GMAI-VL, a general medical vision-language model, with a three-stage training strategy that enhances the integration of visual and textual information. This approach significantly improves the model's ability to process multimodal data, supporting accurate diagnoses and clinical decision-making. Experiments show that GMAI-VL achieves state-of-the-art performance across various multimodal medical tasks, including visual question answering and medical image diagnosis.
Problem

Research questions and friction points this paper is trying to address.

Lack of specialized medical knowledge in general AI models
Need for comprehensive multimodal medical dataset
Improving vision-language integration for medical diagnosis
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal medical dataset from specialized sources
Three-stage training for vision-language integration
State-of-the-art performance in medical tasks
🔎 Similar Papers
No similar papers found.
T
Tian-Xin Li
Shanghai AI Laboratory
Y
Yan-Cheng Su
Shanghai AI Laboratory
W
Wei Li
Shanghai AI Laboratory, Shanghai Jiao Tong University
B
Bin Fu
Shanghai AI Laboratory, Shenzhen Institute of Advanced Technology (SIAT), Chinese Academy of Sciences
Z
Zhe Chen
Shanghai AI Laboratory, Nanjing University
Z
Ziyan Huang
Shanghai AI Laboratory, Shanghai Jiao Tong University
Guoan Wang
Guoan Wang
Stevens Institute of Technology
General Medical AI
Chenglong Ma
Chenglong Ma
Fudan University; Shanghai Innovation Institute
multi-modal modelsgenerative modelsmedical image analysis
Y
Ying Chen
Shanghai AI Laboratory, Xiamen University
M
Ming Hu
Shanghai AI Laboratory, Monash University
Y
Yanjun Li
Shanghai AI Laboratory, East China Normal University
P
Pengcheng Chen
Shanghai AI Laboratory, University of Washington
X
Xiaowei Hu
Shanghai AI Laboratory
Zhongying Deng
Zhongying Deng
University of Cambridge
Deep LearningMulti-modal LearningComputer VisionMedical Image Analysis
Yuanfeng Ji
Yuanfeng Ji
Stanford; HKU
Computer visionMedical Image Analysis
J
Jin Ye
Shanghai AI Laboratory, Monash University
Y
Yu Qiao
Shanghai AI Laboratory
Junjun He
Junjun He
Shanghai Jiao Tong University