Slide-Level Prompt Learning with Vision Language Models for Few-Shot Multiple Instance Learning in Histopathology

📅 2025-03-21

📈 Citations: 0

✨ Influential: 0

career value

159K/year

🤖 AI Summary

To address two key bottlenecks in few-shot classification of whole-slide images (WSIs)—namely, the heavy reliance of multi-instance learning (MIL) on abundant bag-level annotations and the lack of domain-specific pathological priors in vision-language models (VLMs)—this paper proposes the first VLM framework integrating pathology-semantic guidance with sliding-level prompt learning. Built upon the CLIP architecture, our method introduces learnable slide-level textual prompts and a pathology-knowledge-driven patch–tissue semantic alignment mechanism, enabling fine-grained vision–language alignment without bag-level labels. Crucially, we embed domain priors directly into the prompt learning process, overcoming limitations in local–global semantic modeling under few-shot settings. Evaluated on real-world WSI datasets, our approach achieves 8.2–14.7% absolute accuracy gains over state-of-the-art MIL and VLM baselines under 1–5 training examples per class. The code is publicly available.

Technology Category

Application Category

📝 Abstract

In this paper, we address the challenge of few-shot classification in histopathology whole slide images (WSIs) by utilizing foundational vision-language models (VLMs) and slide-level prompt learning. Given the gigapixel scale of WSIs, conventional multiple instance learning (MIL) methods rely on aggregation functions to derive slide-level (bag-level) predictions from patch representations, which require extensive bag-level labels for training. In contrast, VLM-based approaches excel at aligning visual embeddings of patches with candidate class text prompts but lack essential pathological prior knowledge. Our method distinguishes itself by utilizing pathological prior knowledge from language models to identify crucial local tissue types (patches) for WSI classification, integrating this within a VLM-based MIL framework. Our approach effectively aligns patch images with tissue types, and we fine-tune our model via prompt learning using only a few labeled WSIs per category. Experimentation on real-world pathological WSI datasets and ablation studies highlight our method's superior performance over existing MIL- and VLM-based methods in few-shot WSI classification tasks. Our code is publicly available at https://github.com/LTS5/SLIP.

Problem

Research questions and friction points this paper is trying to address.

Few-shot classification in histopathology whole slide images

Utilizing vision-language models for slide-level prompt learning

Integrating pathological prior knowledge for WSI classification

Innovation

Methods, ideas, or system contributions that make the work stand out.

Slide-level prompt learning with vision-language models

Utilizing pathological prior knowledge for patch identification

Few-shot WSI classification via VLM-based MIL framework

🔎 Similar Papers

MSCPT: Few-shot Whole Slide Image Classification with Multi-scale and Context-focused Prompt Tuning