🤖 AI Summary
Existing public waste classification datasets are commonly hindered by fragmentation, inconsistent annotations, and environmental bias, which severely limit model generalization. To address these limitations, this work integrates multiple sources to construct GlobalWasteData (GWD), a unified and large-scale dataset comprising 89,807 images across 14 main categories and 68 subcategories. Through multi-source semantic alignment, standardized annotation protocols, deduplication, quality filtering, and comprehensive metadata generation, GWD significantly enhances data consistency, category balance, and cross-scenario diversity. As the largest and most structurally coherent open-source waste classification dataset to date, GWD provides a robust benchmark for AI-driven applications in waste recognition, automated recycling, and environmental monitoring.
📝 Abstract
The growing amount of waste is a problem for the environment that requires efficient sorting techniques for various kinds of waste. An automated waste classification system is used for this purpose. The effectiveness of these Artificial Intelligence (AI) models depends on the quality and accessibility of publicly available datasets, which provide the basis for training and analyzing classification algorithms. Although several public waste classification datasets exist, they remain fragmented, inconsistent, and biased toward specific environments. Differences in class names, annotation formats, image conditions, and class distributions make it difficult to combine these datasets or train models that generalize well to real world scenarios. To address these issues, we introduce the GlobalWasteData (GWD) archive, a large scale dataset of 89,807 images across 14 main categories, annotated with 68 distinct subclasses. We compile this novel integrated GWD archive by merging multiple publicly available datasets into a single, unified resource. This GWD archive offers consistent labeling, improved domain diversity, and more balanced class representation, enabling the development of robust and generalizable waste recognition models. Additional preprocessing steps such as quality filtering, duplicate removal, and metadata generation further improve dataset reliability. Overall, this dataset offers a strong foundation for Machine Learning (ML) applications in environmental monitoring, recycling automation, and waste identification, and is publicly available to promote future research and reproducibility.