π€ AI Summary
This work addresses the limited interpretability of existing multimodal approaches in computational pathology, which stems from the absence of high-quality datasets supporting explicit reasoning and overly simplistic reasoning mechanisms. To overcome these challenges, the authors propose a novel framework integrating reinforcement learning with multimodal pathological analysis. They construct the first pathology visual question answering (VQA) dataset tailored for complex reasoning tasks and introduce a semantic reward mechanism alongside grouped relative policy optimization to enhance the accuracy and relevance of large language models in joint reasoning over histopathological images and text. Experiments demonstrate that the proposed method surpasses current state-of-the-art models using only 20% of the training data and achieves performance comparable to CLIP on zero-shot image classification, significantly improving both interpretability and reasoning capability.
π Abstract
Interpretability is significant in computational pathology, leading to the development of multimodal information integration from histopathological image and corresponding text data.However, existing multimodal methods have limited interpretability due to the lack of high-quality dataset that support explicit reasoning and inference and simple reasoning process.To address the above problems, we introduce a novel multimodal pathology large language model with strong reasoning capabilities.To improve the generation of accurate and contextually relevant textual descriptions, we design a semantic reward strategy integrated with group relative policy optimization.We construct a high-quality pathology visual question answering (VQA) dataset, specifically designed to support complex reasoning tasks.Comprehensive experiments conducted on this dataset demonstrate that our method outperforms state-of-the-art methods, even when trained with only 20% of the data.Our method also achieves comparable performance on downstream zero-shot image classification task compared with CLIP.