🤖 AI Summary
Convolutional neural networks (CNNs) often achieve high accuracy in cancer histopathology image analysis, yet their performance may stem from non-clinical artifacts in datasets rather than genuine medical features. To uncover this systematic evaluation bias, this study constructs, for the first time, a dataset of background-cropped images devoid of clinical information and conducts controlled experiments with four mainstream CNN architectures across 13 cancer benchmark datasets—including melanoma, breast, colorectal, and lung cancers. The results reveal that models can still attain up to 93% accuracy on meaningless background images, highlighting their acute sensitivity to dataset biases. These findings challenge the reliability of current evaluation paradigms in computational pathology and propose a new framework for trustworthy assessment of AI systems in histopathological diagnosis.
📝 Abstract
Convolutional Neural Networks have shown promising effectiveness in identifying different types of cancer from radiographs. However, the opaque nature of CNNs makes it difficult to fully understand the way they operate, limiting their assessment to empirical evaluation. Here we study the soundness of the standard practices by which CNNs are evaluated for the purpose of cancer pathology. Thirteen highly used cancer benchmark datasets were analyzed, using four common CNN architectures and different types of cancer, such as melanoma, carcinoma, colorectal cancer, and lung cancer. We compared the accuracy of each model with that of datasets made of cropped segments from the background of the original images that do not contain clinically relevant content. Because the rendered datasets contain no clinical information, the null hypothesis is that the CNNs should provide mere chance-based accuracy when classifying these datasets. The results show that the CNN models provided high accuracy when using the cropped segments, sometimes as high as 93\%, even though they lacked biomedical information. These results show that some CNN architectures are more sensitive to bias than others. The analysis shows that the common practices of machine learning evaluation might lead to unreliable results when applied to cancer pathology. These biases are very difficult to identify, and might mislead researchers as they use available benchmark datasets to test the efficacy of CNN methods.