🤖 AI Summary
H&E staining exhibits substantial batch effects across multi-center pathology practices, severely undermining diagnostic consistency and AI model generalizability. To address this, we construct the first multi-center H&E dataset explicitly disentangling stain variation sources across colon, kidney, and skin tissues. We systematically benchmark eight stain normalization methods—including classical approaches (Reinhard, Macenko, Vahadane, histogram matching) and deep generative models (CycleGAN, Pix2Pix)—using both quantitative metrics (SSIM, PSNR) and blinded pathological expert evaluation. Results demonstrate that generative modeling–based methods achieve superior cross-laboratory robustness in stain normalization. Moreover, increased data diversity post-normalization significantly enhances the generalization performance of downstream classification and segmentation models. This work establishes a reproducible benchmark and practical guidelines for histopathological stain standardization.
📝 Abstract
Hematoxylin and Eosin (H&E) has been the gold standard in tissue analysis for decades, however, tissue specimens stained in different laboratories vary, often significantly, in appearance. This variation poses a challenge for both pathologists' and AI-based downstream analysis. Minimizing stain variation computationally is an active area of research. To further investigate this problem, we collected a unique multi-center tissue image dataset, wherein tissue samples from colon, kidney, and skin tissue blocks were distributed to 66 different labs for routine H&E staining. To isolate staining variation, other factors affecting the tissue appearance were kept constant. Further, we used this tissue image dataset to compare the performance of eight different stain normalization methods, including four traditional methods, namely, histogram matching, Macenko, Vahadane, and Reinhard normalization, and two deep learning-based methods namely CycleGAN and Pixp2pix, both with two variants each. We used both quantitative and qualitative evaluation to assess the performance of these methods. The dataset's inter-laboratory staining variation could also guide strategies to improve model generalizability through varied training data