🤖 AI Summary
The ImageNet single-label evaluation paradigm overlooks inherent image polysemy, leading to misjudgments of models’ semantic understanding capabilities. Method: This work first systematically identifies the primary cause of performance degradation on ImageNet-V2 as a discrepancy in multi-label image proportions—not model generalization failure—and proposes the first multi-label capability evaluation framework tailored for single-label pre-trained models, encompassing multi-label statistical analysis, cross-dataset label distribution modeling, and zero-shot multi-label inference evaluation. Contribution/Results: Experiments reveal that mainstream deep neural networks possess significant implicit multi-label recognition ability on ImageNet. Leveraging this insight, we correct 11%–14% of the “accuracy drop” misclassifications in ImageNet-V2, demonstrating that model robustness has been systematically underestimated. This work advances benchmark evaluation from single-label to multi-label paradigms.
📝 Abstract
ImageNet, an influential dataset in computer vision, is traditionally evaluated using single-label classification, which assumes that an image can be adequately described by a single concept or label. However, this approach may not fully capture the complex semantics within the images available in ImageNet, potentially hindering the development of models that effectively learn these intricacies. This study critically examines the prevalent single-label benchmarking approach and advocates for a shift to multi-label benchmarking for ImageNet. This shift would enable a more comprehensive assessment of the capabilities of deep neural network (DNN) models. We analyze the effectiveness of pre-trained state-of-the-art DNNs on ImageNet and one of its variants, ImageNetV2. Studies in the literature have reported unexpected accuracy drops of 11% to 14% on ImageNetV2. Our findings show that these reported declines are largely attributable to a characteristic of the dataset that has not received sufficient attention -- the proportion of images with multiple labels. Taking this characteristic into account, the results of our experiments provide evidence that there is no substantial degradation in effectiveness on ImageNetV2. Furthermore, we acknowledge that ImageNet pre-trained models exhibit some capability at capturing the multi-label nature of the dataset even though they were trained under the single-label assumption. Consequently, we propose a new evaluation approach to augment existing approaches that assess this capability. Our findings highlight the importance of considering the multi-label nature of the ImageNet dataset during benchmarking. Failing to do so could lead to incorrect conclusions regarding the effectiveness of DNNs and divert research efforts from addressing other substantial challenges related to the reliability and robustness of these models.