🤖 AI Summary
This work addresses the limitation of existing medical vision-language models (VLMs), which rely on coarse-grained contrastive learning and struggle to capture the systematic visual knowledge embedded in medical phenotype ontologies. To overcome this, the authors propose PhenoLIP, a novel framework that explicitly integrates structured phenotype ontologies into medical VLM pretraining. The approach introduces PhenoKG—the first large-scale, phenotype-centric multimodal knowledge graph—and PhenoBench, a corresponding evaluation benchmark. PhenoLIP employs a two-stage pretraining strategy leveraging ontology-guided phenotype embeddings, teacher-guided distillation, and multimodal contrastive learning. Experimental results demonstrate that PhenoLIP outperforms BiomedCLIP by 8.85% on phenotype classification and surpasses BIOMEDICA by 15.03% in cross-modal retrieval, significantly enhancing the model’s structured understanding and interpretability.
📝 Abstract
Recent progress in large-scale CLIP-like vision-language models(VLMs) has greatly advanced medical image analysis. However, most existing medical VLMs still rely on coarse image-text contrastive objectives and fail to capture the systematic visual knowledge encoded in well-defined medical phenotype ontologies. To address this gap, we construct PhenoKG, the first large-scale, phenotype-centric multimodal knowledge graph that encompasses over 520K high-quality image-text pairs linked to more than 3,000 phenotypes. Building upon PhenoKG, we propose PhenoLIP, a novel pretraining framework that explicitly incorporates structured phenotype knowledge into medical VLMs through a two-stage process. We first learn a knowledge-enhanced phenotype embedding space from textual ontology data and then distill this structured knowledge into multimodal pretraining via a teacher-guided knowledge distillation objective. To support evaluation, we further introduce PhenoBench, an expert-verified benchmark designed for phenotype recognition, comprising over 7,800 image--caption pairs covering more than 1,000 phenotypes. Extensive experiments demonstrate that PhenoLIP outperforms previous state-of-the-art baselines, improving upon BiomedCLIP in phenotype classification accuracy by 8.85\% and BIOMEDICA in cross-modal retrieval by 15.03%, underscoring the value of integrating phenotype-centric priors into medical VLMs for structured and interpretable medical image understanding.