🤖 AI Summary
Current medical foundation models lack sufficient reliability in real-world clinical settings due to the absence of systematic robustness evaluation. This work introduces the first unified robustness benchmark for medical vision-language and segmentation foundation models, encompassing critical clinical tasks such as visual question answering, radiology report generation, and image segmentation. The framework incorporates diverse realistic perturbations—including adversarial attacks, domain shifts, and image degradations—to simulate non-ideal clinical conditions. Experimental results reveal significant performance fragility among state-of-the-art medical foundation models under these perturbations, highlighting their vulnerability in practical deployment scenarios. The proposed benchmark establishes a crucial reliability assessment standard and provides foundational insights for the safe and effective clinical translation of medical AI systems.
📝 Abstract
Medical foundation models (MedFMs) have emerged as transformative tools in healthcare, demonstrating capabilities across diverse clinical applications. These models can be broadly categorized into two paradigms: Medical Vision-Language Models (Med-VLMs) and segmentation foundation models. Med-VLMs range from medical-specialized models such as LLaVA-Med and MedGemma, to general-purpose models like GPT-4o and Gemini, all capable of medical image understanding tasks including visual question answering (VQA), report generation, and visual grounding. Concurrently, the Segment Anything Model (SAM) has catalyzed a new generation of medical segmentation models, with adaptations like SAM-Med2D and MedSAM. The widespread clinical deployment of these models thus necessitates rigorous evaluation of their reliability under real-world conditions.