🤖 AI Summary
Existing medical vision-language pretraining methods struggle with noisy web-scale data and unstructured, lengthy clinical notes. To address these challenges, we propose an ontology-guided multi-agent collaborative pretraining framework. First, we construct a multi-agent system based on foundation models to autonomously generate high-quality, fine-grained skin image descriptions, validated via retrieval for semantic fidelity. Second, we design an ontology-guided attention mechanism coupled with multi-level contrastive learning to explicitly model semantic relationships among medical concepts, enabling holistic–local cross-modal alignment. Third, we incorporate knowledge distillation to enhance generalization. Evaluated on eight dermatological datasets, our method achieves state-of-the-art performance in zero-shot disease classification and cross-modal retrieval. Furthermore, we publicly release Derm1M-AgentAug—a large-scale, high-quality augmented dataset comprising 400K image–text pairs—facilitating future research in medical vision-language understanding.
📝 Abstract
Vision-language pretraining (VLP) has emerged as a powerful paradigm in medical image analysis, enabling representation learning from large-scale image-text pairs without relying on expensive manual annotations. However, existing methods often struggle with the noise inherent in web-collected data and the complexity of unstructured long medical texts. To address these challenges, we propose a novel VLP framework integrating a Multi-Agent data GENeration (MAGEN) system and Ontology-based Multi-Aspect Knowledge-Enhanced (O-MAKE) pretraining. First, MAGEN enhances data quality by synthesizing knowledge-enriched descriptions via a foundation model-assisted captioning and retrieval-based verification pipeline. Second, O-MAKE addresses the difficulty of learning from long, unstructured texts by decomposing them into distinct knowledge aspects. This facilitates fine-grained alignment at both global and patch levels, while explicitly modeling medical concept relationships through ontology-guided mechanisms. We validate our framework in the field of dermatology, where comprehensive experiments demonstrate the effectiveness of each component. Our approach achieves state-of-the-art zero-shot performance on disease classification and cross-modal retrieval tasks across eight datasets. Our code and the augmented dataset Derm1M-AgentAug, comprising over 400k skin-image-text pairs, will be released at https://github.com/SiyuanYan1/Derm1M.