🤖 AI Summary
Fine-grained glomerular subtype classification in renal pathology faces severe clinical deployment challenges due to extreme scarcity of labeled data (only 4–8 samples per class). Existing computational pathology methods rely heavily on fully supervised, coarse-grained image models and fail to generalize under such low-data regimes.
Method: We reformulate the task as a few-shot vision-language learning problem—the first such approach in renal pathology—and propose a novel paradigm that jointly optimizes discriminative learning between positive and negative samples and cross-modal semantic alignment between histopathological images and clinical text. We systematically investigate the impact of pathology-specific versus general-purpose vision-language model (VLM) architectures, domain-knowledge injection, and adaptation strategies on multimodal representation geometry and diagnostic performance.
Results: Our pathology-tailored VLM achieves significantly improved classification discrimination and prediction calibration using only minimal annotations; performance scales robustly with increasing sample size, demonstrating the feasibility of accurate, interpretable glomerular subtyping under few-shot conditions.
📝 Abstract
Fine-grained glomerular subtyping is central to kidney biopsy interpretation, but clinically valuable labels are scarce and difficult to obtain. Existing computational pathology approaches instead tend to evaluate coarse diseased classification under full supervision with image-only models, so it remains unclear how vision-language models (VLMs) should be adapted for clinically meaningful subtyping under data constraints. In this work, we model fine-grained glomerular subtyping as a clinically realistic few-shot problem and systematically evaluate both pathology-specialized and general-purpose vision-language models under this setting. We assess not only classification performance (accuracy, AUC, F1) but also the geometry of the learned representations, examining feature alignment between image and text embeddings and the separability of glomerular subtypes. By jointly analyzing shot count, model architecture and domain knowledge, and adaptation strategy, this study provides guidance for future model selection and training under real clinical data constraints. Our results indicate that pathology-specialized vision-language backbones, when paired with the vanilla fine-tuning, are the most effective starting point. Even with only 4-8 labeled examples per glomeruli subtype, these models begin to capture distinctions and show substantial gains in discrimination and calibration, though additional supervision continues to yield incremental improvements. We also find that the discrimination between positive and negative examples is as important as image-text alignment. Overall, our results show that supervision level and adaptation strategy jointly shape both diagnostic performance and multimodal structure, providing guidance for model selection, adaptation strategies, and annotation investment.