🤖 AI Summary
To address challenges in whole-slide image (WSI) classification and automated pathological caption generation—including tile redundancy, loss of spatial context, and difficulty in semantic modeling—this paper proposes GNN-ViTCap, the first framework integrating Vision Transformers (ViTs) with Graph Neural Networks (GNNs). It introduces deep embedding-based dynamic clustering and scalar-point attention to select representative tiles, and jointly fine-tunes a large language model (LLM) for end-to-end multimodal classification and caption generation. By unifying structural and semantic representation learning, GNN-ViTCap achieves state-of-the-art performance: 0.934 F1-score and 0.963 AUC on BreakHis and PatchGastric for classification; and BLEU-4 = 0.811 and METEOR = 0.569 for caption quality—substantially outperforming existing methods.
📝 Abstract
Microscopic assessment of histopathology images is vital for accurate cancer diagnosis and treatment. Whole Slide Image (WSI) classification and captioning have become crucial tasks in computer-aided pathology. However, microscopic WSI face challenges such as redundant patches and unknown patch positions due to subjective pathologist captures. Moreover, generating automatic pathology captions remains a significant challenge. To address these issues, we introduce a novel GNN-ViTCap framework for classification and caption generation from histopathological microscopic images. First, a visual feature extractor generates patch embeddings. Redundant patches are then removed by dynamically clustering these embeddings using deep embedded clustering and selecting representative patches via a scalar dot attention mechanism. We build a graph by connecting each node to its nearest neighbors in the similarity matrix and apply a graph neural network to capture both local and global context. The aggregated image embeddings are projected into the language model's input space through a linear layer and combined with caption tokens to fine-tune a large language model. We validate our method on the BreakHis and PatchGastric datasets. GNN-ViTCap achieves an F1 score of 0.934 and an AUC of 0.963 for classification, along with a BLEU-4 score of 0.811 and a METEOR score of 0.569 for captioning. Experimental results demonstrate that GNN-ViTCap outperforms state of the art approaches, offering a reliable and efficient solution for microscopy based patient diagnosis.