GeoDE: a Geographically Diverse Evaluation Dataset for Object Recognition

πŸ“… 2023-01-05
πŸ›οΈ Neural Information Processing Systems
πŸ“ˆ Citations: 42
✨ Influential: 4
πŸ“„ PDF
πŸ€– AI Summary
Existing object recognition datasets predominantly rely on web-crawled images, introducing geographic bias (over-representing North America and Europe), privacy risks from personally identifiable information (PII), and stereotypical biases. Method: We propose GeoDEβ€”a geographically diverse, crowdsourced image benchmark covering six global regions and 40 categories (61,940 images)β€”with strict PII removal and balanced regional sampling. GeoDE introduces a novel geographic-stratified crowdsourcing paradigm for data collection and annotation, explicitly avoiding inherent biases of web-sourced data. Contribution/Results: Experiments reveal significant performance degradation of mainstream models (e.g., ResNet-50) on non-Western regions. Incremental fine-tuning with only 1,000–2,000 GeoDE images per region boosts cross-regional average accuracy by up to 12.3%, demonstrating that small-scale, geographically diverse data yields substantial gains in model generalization across regions. GeoDE is the first publicly available benchmark designed to systematically address geographic representativeness in visual recognition.
πŸ“ Abstract
Current dataset collection methods typically scrape large amounts of data from the web. While this technique is extremely scalable, data collected in this way tends to reinforce stereotypical biases, can contain personally identifiable information, and typically originates from Europe and North America. In this work, we rethink the dataset collection paradigm and introduce GeoDE, a geographically diverse dataset with 61,940 images from 40 classes and 6 world regions, and no personally identifiable information, collected through crowd-sourcing. We analyse GeoDE to understand differences in images collected in this manner compared to web-scraping. Despite the smaller size of this dataset, we demonstrate its use as both an evaluation and training dataset, highlight shortcomings in current models, as well as show improved performances when even small amounts of GeoDE (1000 - 2000 images per region) are added to a training dataset. We release the full dataset and code at https://geodiverse-data-collection.cs.princeton.edu/
Problem

Research questions and friction points this paper is trying to address.

Addresses geographic bias in object recognition datasets
Introduces a diverse dataset from six world regions
Mitigates stereotypical biases and privacy concerns in data collection
Innovation

Methods, ideas, or system contributions that make the work stand out.

Geographically diverse dataset collection method
Soliciting global images to avoid biases
No personally identifiable information included
πŸ”Ž Similar Papers
No similar papers found.