Quilt-1M: One Million Image-Text Pairs for Histopathology

📅 2023-06-20
🏛️ Neural Information Processing Systems
📈 Citations: 63
Influential: 9
📄 PDF
🤖 AI Summary
Medical pathology multimodal learning is hindered by the scarcity of high-quality image–text paired data. To address this, we introduce Quilt-1M—the largest publicly available pathology vision–language dataset to date—comprising one million histopathology tile–caption pairs. Crucially, we pioneer the systematic mining of clinical teaching videos from YouTube and employ an automated pipeline integrating automatic speech recognition (ASR), large language models (LLMs), clinical knowledge bases, and rule-based algorithms to generate precise, high-fidelity image–text alignments. Leveraging Quilt-1M, we fine-tune CLIP and achieve state-of-the-art performance across 13 fine-grained pathology tasks, including zero-shot classification and cross-modal retrieval. This work not only fills a critical gap in large-scale, expert-annotated multimodal pathology data but also demonstrates the viability of video-derived data for high-accuracy, domain-specific alignment—establishing a new paradigm for improving generalizability and interpretability in pathology AI.
📝 Abstract
Recent accelerations in multi-modal applications have been made possible with the plethora of image and text data available online. However, the scarcity of analogous data in the medical field, specifically in histopathology, has halted comparable progress. To enable similar representation learning for histopathology, we turn to YouTube, an untapped resource of videos, offering 1,087 hours of valuable educational histopathology videos from expert clinicians. From YouTube, we curate Quilt: a large-scale vision-language dataset consisting of 768,826 image and text pairs. Quilt was automatically curated using a mixture of models, including large language models, handcrafted algorithms, human knowledge databases, and automatic speech recognition. In comparison, the most comprehensive datasets curated for histopathology amass only around 200K samples. We combine Quilt with datasets from other sources, including Twitter, research papers, and the internet in general, to create an even larger dataset: Quilt-1M, with 1M paired image-text samples, marking it as the largest vision-language histopathology dataset to date. We demonstrate the value of Quilt-1M by fine-tuning a pre-trained CLIP model. Our model outperforms state-of-the-art models on both zero-shot and linear probing tasks for classifying new histopathology images across 13 diverse patch-level datasets of 8 different sub-pathologies and cross-modal retrieval tasks.
Problem

Research questions and friction points this paper is trying to address.

Multimodal Learning
Pathology
Data Scarcity
Innovation

Methods, ideas, or system contributions that make the work stand out.

QUILT-1M Dataset
Multimodal Data Integration
CLIP Fine-tuning
🔎 Similar Papers
No similar papers found.
W
W. O. Ikezogwo
University of Washington
M
M. S. Seyfioglu
University of Washington
Fatemeh Ghezloo
Fatemeh Ghezloo
Microsoft
Multimodal learningData Centric MLComputer VisionBiomedical AI
D
Dylan Stefan Chan Geva
University of Washington
F
Fatwir Sheikh Mohammed
University of Washington
P
Pavan Kumar Anand
University of Washington
Ranjay Krishna
Ranjay Krishna
University of Washington, Allen Institute for AI
Computer VisionNatural Language ProcessingMachine LearningHuman Computer Interaction
L
Linda G. Shapiro
University of Washington