🤖 AI Summary
Mainstream vision-language datasets exhibit significant cultural and socioeconomic biases, overrepresenting high-income Western contexts and consequently impairing model generalization and performance in low-income and non-Western communities. To address this, we propose a function-centered cross-cultural object modeling paradigm and introduce the first vision-language dataset spanning 46 everyday functional categories and 288 object types across diverse economic backgrounds. Built upon Dollar Street, it features human-verified, function-aligned annotations and integrates CLIP-based cross-cultural semantic analysis. We further introduce two novel evaluation metrics: functional consistency and inter-group performance disparity. Empirical results demonstrate that function-centered labeling reduces the median performance gap between high- and low-income groups by 6 percentage points, substantially improving recognition accuracy in resource-constrained settings and enhancing algorithmic fairness.
📝 Abstract
Culture shapes the objects people use and for what purposes, yet mainstream Vision-Language (VL) datasets frequently exhibit cultural biases, disproportionately favoring higher-income, Western contexts. This imbalance reduces model generalizability and perpetuates performance disparities, especially impacting lower-income and non-Western communities. To address these disparities, we propose a novel function-centric framework that categorizes objects by the functions they fulfill, across diverse cultural and economic contexts. We implement this framework by creating the Culture Affordance Atlas, a re-annotated and culturally grounded restructuring of the Dollar Street dataset spanning 46 functions and 288 objects publicly available at https://lit.eecs.umich.edu/CultureAffordance-Atlas/index.html. Through extensive empirical analyses using the CLIP model, we demonstrate that function-centric labels substantially reduce socioeconomic performance gaps between high- and low-income groups by a median of 6 pp (statistically significant), improving model effectiveness for lower-income contexts. Furthermore, our analyses reveals numerous culturally essential objects that are frequently overlooked in prominent VL datasets. Our contributions offer a scalable pathway toward building inclusive VL datasets and equitable AI systems.