🤖 AI Summary
This work addresses the mismatch between ImageNet’s single-label annotation and the inherent multi-object nature of real-world images, which introduces label noise and limits model performance. The authors propose the first large-scale, fully automatic multi-label annotation framework that requires no manual labeling. Leveraging a self-supervised Vision Transformer for unsupervised object region discovery, the method integrates a lightweight classifier with region-level label propagation to generate a globally consistent, high-quality multi-label training set. This approach overcomes prior limitations that relied on human annotations or were restricted to validation-set refinement. It achieves notable improvements in top-1 accuracy—up to +2.0 on ImageNet-ReaL and +1.5 on ImageNet-V2—and substantially boosts mean average precision (mAP) on transfer tasks, with gains of +4.2 on COCO and +2.3 on VOC.
📝 Abstract
The original ImageNet benchmark enforces a single-label assumption, despite many images depicting multiple objects. This leads to label noise and limits the richness of the learning signal. Multi-label annotations more accurately reflect real-world visual scenes, where multiple objects co-occur and contribute to semantic understanding, enabling models to learn richer and more robust representations. While prior efforts (e.g., ReaL, ImageNetv2) have improved the validation set, there has not yet been a scalable, high-quality multi-label annotation for the training set. To this end, we present an automated pipeline to convert the ImageNet training set into a multi-label dataset, without human annotations. Using self-supervised Vision Transformers, we perform unsupervised object discovery, select regions aligned with original labels to train a lightweight classifier, and apply it to all regions to generate coherent multi-label annotations across the dataset. Our labels show strong alignment with human judgment in qualitative evaluations and consistently improve performance across quantitative benchmarks. Compared to traditional single-label scheme, models trained with our multi-label supervision achieve consistently better in-domain accuracy across architectures (up to +2.0 top-1 accuracy on ReaL and +1.5 on ImageNet-V2) and exhibit stronger transferability to downstream tasks (up to +4.2 and +2.3 mAP on COCO and VOC, respectively). These results underscore the importance of accurate multi-label annotations for enhancing both classification performance and representation learning. Project code and the generated multi-label annotations are available at https://github.com/jchen175/MultiLabel-ImageNet.