GlobalWasteData: A Large-Scale, Integrated Dataset for Robust Waste Classification and Environmental Monitoring

📅 2026-02-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing public waste classification datasets are commonly hindered by fragmentation, inconsistent annotations, and environmental bias, which severely limit model generalization. To address these limitations, this work integrates multiple sources to construct GlobalWasteData (GWD), a unified and large-scale dataset comprising 89,807 images across 14 main categories and 68 subcategories. Through multi-source semantic alignment, standardized annotation protocols, deduplication, quality filtering, and comprehensive metadata generation, GWD significantly enhances data consistency, category balance, and cross-scenario diversity. As the largest and most structurally coherent open-source waste classification dataset to date, GWD provides a robust benchmark for AI-driven applications in waste recognition, automated recycling, and environmental monitoring.

Technology Category

Application Category

📝 Abstract
The growing amount of waste is a problem for the environment that requires efficient sorting techniques for various kinds of waste. An automated waste classification system is used for this purpose. The effectiveness of these Artificial Intelligence (AI) models depends on the quality and accessibility of publicly available datasets, which provide the basis for training and analyzing classification algorithms. Although several public waste classification datasets exist, they remain fragmented, inconsistent, and biased toward specific environments. Differences in class names, annotation formats, image conditions, and class distributions make it difficult to combine these datasets or train models that generalize well to real world scenarios. To address these issues, we introduce the GlobalWasteData (GWD) archive, a large scale dataset of 89,807 images across 14 main categories, annotated with 68 distinct subclasses. We compile this novel integrated GWD archive by merging multiple publicly available datasets into a single, unified resource. This GWD archive offers consistent labeling, improved domain diversity, and more balanced class representation, enabling the development of robust and generalizable waste recognition models. Additional preprocessing steps such as quality filtering, duplicate removal, and metadata generation further improve dataset reliability. Overall, this dataset offers a strong foundation for Machine Learning (ML) applications in environmental monitoring, recycling automation, and waste identification, and is publicly available to promote future research and reproducibility.
Problem

Research questions and friction points this paper is trying to address.

waste classification
dataset fragmentation
annotation inconsistency
class imbalance
domain bias
Innovation

Methods, ideas, or system contributions that make the work stand out.

waste classification
integrated dataset
data harmonization
domain diversity
balanced class distribution
🔎 Similar Papers
No similar papers found.
M
Misbah Ijaz
Department of Computer Science, University of Gujrat, Gujrat, 51700, Pakistan
S
Saif Ur Rehman Khan
Department of Computer Science, Rhineland-Palatinate Technical University of Kaiserslautern-Landau, Kaiserslautern, 67663, Germany
A
Abd Ur Rehman
Department of Computer Science, University of Gujrat, Gujrat, 51700, Pakistan
T
Tayyaba Asif
Department of Computer Science, Rhineland-Palatinate Technical University of Kaiserslautern-Landau, Kaiserslautern, 67663, Germany
S
Sebastian Vollmer
German Research Center for Artificial Intelligence, Kaiserslautern, 67663, Germany
Andreas Dengel
Andreas Dengel
Professor of Computer Science, University of Kaiserslautern & Executive Director, DFKI
Artificial IntelligenceMachine LearningDocument AnalysisSemantic Technologies
Muhammad Nabeel Asim
Muhammad Nabeel Asim
German Research Center for Artificial Intelligence
Artificial Intelligence