🤖 AI Summary
General-purpose medical AI systems face limitations in clinical deployment due to insufficient domain-specific knowledge and inadequate multimodal data processing capabilities. To address this, we introduce GMAI-VL-5.5M—the first large-scale, specialized medical vision-language dataset comprising 5.5 million high-quality image–text pairs—and propose a novel three-stage vision-language joint training paradigm: (1) visual encoder initialization, (2) image–text alignment fine-tuning, and (3) task-aware refinement, enabling deep integration of ViT-based encoders with large language models. Our methodological innovations include: (1) multi-source medical data cleaning and structured pairing augmentation; (2) a clinical-scenario-oriented progressive pretraining strategy; and (3) a unified cross-task, cross-modal modeling framework. Extensive evaluations demonstrate state-of-the-art performance across medical visual question answering, lesion detection, and radiology report generation—outperforming CLIP, Med-PaLM, and PMC-VL—with superior generalizability and clinically interpretable outputs.
📝 Abstract
Despite significant advancements in general AI, its effectiveness in the medical domain is limited by the lack of specialized medical knowledge. To address this, we formulate GMAI-VL-5.5M, a multimodal medical dataset created by converting hundreds of specialized medical datasets with various annotations into high-quality image-text pairs. This dataset offers comprehensive task coverage, diverse modalities, and rich image-text data. Building upon this dataset, we develop GMAI-VL, a general medical vision-language model, with a three-stage training strategy that enhances the integration of visual and textual information. This approach significantly improves the model's ability to process multimodal data, supporting accurate diagnoses and clinical decision-making. Experiments show that GMAI-VL achieves state-of-the-art performance across various multimodal medical tasks, including visual question answering and medical image diagnosis.