๐ค AI Summary
This study addresses the problem of predicting pathologistsโ spatiotemporal visual attention distributions while reviewing whole-slide images (WSIs) for cancer diagnosis. To model dynamic scanning trajectories, we propose a two-stage Transformer architecture: the first stage generates multi-scale attention heatmaps, and the second stage autoregressively predicts fixation sequences. We further introduce a semantics-preserving fixation extraction algorithm that jointly captures magnification level, spatial coordinates, and temporal dynamics. The model integrates digital microscope trajectory data with multi-scale histopathological features. Evaluated on 123 WSIs, it significantly outperforms random and baseline methods. This work presents the first end-to-end prediction framework for expert-level WSI scanning paths. It provides a quantifiable, interpretable attention assessment tool for pathology training and advances intelligent systems for diagnostic assistance and medical education.
๐ Abstract
The ability to predict the attention of expert pathologists could lead to decision support systems for better pathology training. We developed methods to predict the spatio-temporal (where and when) movements of pathologists' attention as they grade whole slide images (WSIs) of prostate cancer. We characterize a pathologist's attention trajectory by their x, y, and m (magnification) movements of a viewport as they navigate WSIs using a digital microscope. This information was obtained from 43 pathologists across 123 WSIs, and we consider the task of predicting the pathologist attention scanpaths constructed from the viewport centers. We introduce a fixation extraction algorithm that simplifies an attention trajectory by extracting fixations in the pathologist's viewing while preserving semantic information, and we use these pre-processed data to train and test a two-stage model to predict the dynamic (scanpath) allocation of attention during WSI reading via intermediate attention heatmap prediction. In the first stage, a transformer-based sub-network predicts the attention heatmaps (static attention) across different magnifications. In the second stage, we predict the attention scanpath by sequentially modeling the next fixation points in an autoregressive manner using a transformer-based approach, starting at the WSI center and leveraging multi-magnification feature representations from the first stage. Experimental results show that our scanpath prediction model outperforms chance and baseline models. Tools developed from this model could assist pathology trainees in learning to allocate their attention during WSI reading like an expert.