GMAI-VL & GMAI-VL-5.5M: A Large Vision-Language Model and A Comprehensive Multimodal Dataset Towards General Medical AI

📅 2024-11-21

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

200K/year

🤖 AI Summary

General-purpose medical AI systems face limitations in clinical deployment due to insufficient domain-specific knowledge and inadequate multimodal data processing capabilities. To address this, we introduce GMAI-VL-5.5M—the first large-scale, specialized medical vision-language dataset comprising 5.5 million high-quality image–text pairs—and propose a novel three-stage vision-language joint training paradigm: (1) visual encoder initialization, (2) image–text alignment fine-tuning, and (3) task-aware refinement, enabling deep integration of ViT-based encoders with large language models. Our methodological innovations include: (1) multi-source medical data cleaning and structured pairing augmentation; (2) a clinical-scenario-oriented progressive pretraining strategy; and (3) a unified cross-task, cross-modal modeling framework. Extensive evaluations demonstrate state-of-the-art performance across medical visual question answering, lesion detection, and radiology report generation—outperforming CLIP, Med-PaLM, and PMC-VL—with superior generalizability and clinically interpretable outputs.

Technology Category

Application Category

📝 Abstract

Despite significant advancements in general AI, its effectiveness in the medical domain is limited by the lack of specialized medical knowledge. To address this, we formulate GMAI-VL-5.5M, a multimodal medical dataset created by converting hundreds of specialized medical datasets with various annotations into high-quality image-text pairs. This dataset offers comprehensive task coverage, diverse modalities, and rich image-text data. Building upon this dataset, we develop GMAI-VL, a general medical vision-language model, with a three-stage training strategy that enhances the integration of visual and textual information. This approach significantly improves the model's ability to process multimodal data, supporting accurate diagnoses and clinical decision-making. Experiments show that GMAI-VL achieves state-of-the-art performance across various multimodal medical tasks, including visual question answering and medical image diagnosis.

Problem

Research questions and friction points this paper is trying to address.

Lack of specialized medical knowledge in general AI models

Need for comprehensive multimodal medical dataset

Improving vision-language integration for medical diagnosis

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal medical dataset from specialized sources

Three-stage training for vision-language integration

State-of-the-art performance in medical tasks

🔎 Similar Papers

No similar papers found.