🤖 AI Summary
Medical vision-language pretraining has long been hindered by the scarcity of large-scale, densely annotated datasets that jointly support semantic understanding and pixel-level localization. To address this, we introduce MedNarrative—the first localized multimodal narrative dataset for medical education videos—comprising 4.7M image-text pairs and 1M densely annotated samples with cursor trajectories and bounding boxes, enabling the first spatiotemporal alignment of speech, imaging, and interactive behavior. We propose a novel medical-domain adaptation of Localized Narratives and present GenMedClip, a unified model integrating multimodal contrastive learning, trajectory-aware supervision, and cross-specialty optimization across 12 clinical domains. Evaluated on a newly constructed multimodal medical imaging benchmark, GenMedClip significantly outperforms state-of-the-art methods. All code, data, models, and an interactive demonstration system are fully open-sourced.
📝 Abstract
We propose MedicalNarratives, a dataset curated from medical pedagogical videos similar in nature to data collected in Think-Aloud studies and inspired by Localized Narratives, which collects grounded image-text data by curating instructors' speech and mouse cursor movements synchronized in time. MedicalNarratives enables pretraining of both semantic and dense objectives, alleviating the need to train medical semantic and dense tasks disparately due to the lack of reasonably sized datasets. Our dataset contains 4.7M image-text pairs from videos and articles, with 1M samples containing dense annotations in the form of traces and bounding boxes. To evaluate the utility of MedicalNarratives, we train GenMedClip based on the CLIP architecture using our dataset spanning 12 medical domains and demonstrate that it outperforms previous state-of-the-art models on a newly constructed medical imaging benchmark that comprehensively evaluates performance across all modalities. Data, demo, code and models available at https://medical-narratives.github.io