🤖 AI Summary
To address the limitation of existing whole-slide image (WSI) diagnostic models—namely, their reliance on local patches and inability to model global spatial context—this paper introduces QUILT-INSTRUCT, the first spatially grounded, context-coherent pathology instruction-tuning dataset for WSI analysis, comprising over 107,000 QA pairs. Innovatively, it automatically extracts pixel-level cursor trajectories and corresponding pathological narrations from educational YouTube videos. Methodologically, it integrates multi-granularity vision–language alignment, LLaVA-based pathology-domain visual instruction tuning, and a WSI-level cross-patch reasoning mechanism. Evaluated on both a curated pathology VQA benchmark and public benchmarks, QUILT-INSTRUCT significantly outperforms state-of-the-art methods: achieving over a 10% improvement in GPT-4 evaluation scores, and +4% and +9% gains in open-set and closed-set accuracy, respectively.
📝 Abstract
Diagnosis in histopathology requires a global whole slide images (WSIs) analysis, requiring pathologists to compound evidence from different WSI patches. The gigapixel scale of WSIs poses a challenge for histopathology multimodal models. Training multi-model models for histopathology requires instruction tuning datasets, which currently contain information for individual image patches, without a spatial grounding of the concepts within each patch and without a wider view of the WSI. To bridge this gap, we introduce QUILT-INSTRUCT, a large-scale dataset of107, 131 histopathology-specific instruction question/answer pairs, grounded within diagnostically relevant image patches that make up the WSI. Our dataset is collected by leveraging educational histopathology videos from YouTube, which provides spatial localization of narrations by automatically extracting the narrators' cursor positions. QUILT-INSTRUCT supports contextual reasoning by extracting diagnosis and supporting facts from the entire WSI. Using QUILT-INSTRUCT, we train QUILT-LLAVA, which can reason beyond the given single image patch, enabling diagnostic reasoning across patches. To evaluate QUILT-LLAVA, we propose a compre-hensive evaluation dataset created from 985 images and 1283 human-generated question-answers. We also thor-oughly evaluate QUILT-LLAVA using public histopathology datasets, where QUILT-LLAVA significantly outperforms SOTA by over 10% on relative GPT-4 score and 4% and 9% on open and closed set VQA11Our code, data, and model is publicly accessible at quilt-llava.github.io..