🤖 AI Summary
This work proposes a lightweight and deployable medical vision Transformer pretraining framework that addresses the limitations of existing methods, which struggle to model complex semantic relationships among clinical findings and rely on computationally intensive vision-language models. The approach leverages a frozen large language model as a structured semantic teacher, converting clinical findings into verifiable JSON field–state pairs via a Unified Medical Schema (UMS). Efficient knowledge distillation is achieved through answer-aware masking and Structured Prediction Decomposition (SPD), while orthogonality-regularized query-grouped attention enhances representational capacity. Notably, the large language model is discarded after training. The method achieves a macro AUC of 0.8588 on CheXpert linear probing—surpassing BiomedCLIP by 6.65 points using only 1/500th of the training data—and demonstrates exceptional zero-shot transfer and cross-modal generalization on NIH ChestX-ray14, LIDC-IDRI, and OrganAMNIST benchmarks.
📝 Abstract
Vision-language pretraining has driven significant progress in medical image analysis. However, current methods typically supervise visual encoders using one-hot labels or free-form text, neither of which effectively captures the complex semantic relationships among clinical findings. In this study, we introduce VIVID-Med, a novel framework that leverages a frozen large language model (LLM) as a structured semantic teacher to pretrain medical vision transformers (ViTs). VIVID-Med translates clinical findings into verifiable JSON field-state pairs via a Unified Medical Schema (UMS), utilizing answerability-aware masking to focus optimization. It then employs Structured Prediction Decomposition (SPD) to partition cross-attention into orthogonality-regularized query groups, extracting complementary visual aspects. Crucially, the LLM is discarded post-training, yielding a lightweight, deployable ViT-only backbone. We evaluated VIVID-Med across multiple settings: on CheXpert linear probing, it achieves a macro-AUC of 0.8588, outperforming BiomedCLIP by +6.65 points while using 500x less data. It also demonstrates robust zero-shot cross-domain transfer to NIH ChestX-ray14 (0.7225 macro-AUC) and strong cross-modality generalization to CT, achieving 0.8413 AUC on LIDC-IDRI lung nodule classification and 0.9969 macro-AUC on OrganAMNIST 11-organ classification. VIVID-Med offers a highly efficient, scalable alternative to deploying resource-heavy vision-language models in clinical settings.