🤖 AI Summary
This study investigates how “hyper-datafication”—the intensifying reliance on massive, often unexamined data streams in frontier artificial intelligence—systematically externalizes environmental burdens, labor risks, and representational injustices onto the Global South and marginalized communities. Integrating quantitative analysis of approximately 550,000 datasets from Hugging Face Hub (assessing storage energy consumption, carbon footprint, and linguistic representation), qualitative interviews with data workers in Kenya, and geospatial data on global data center distribution, this work introduces the concept of hyper-datafication to demonstrate that data production has become an active driver—not merely a prerequisite—of AI development. The study further proposes Data PROOFS, a six-dimensional sustainability framework encompassing provenance, resource awareness, ownership, oversight, fairness, and sovereignty, offering both empirical grounding and actionable guidance for AI ethics and policy.
📝 Abstract
Large-scale data has fuelled the success of frontier artificial intelligence (AI) models over the past decade. This expansion has relied on sustained efforts by large technology corporations to aggregate and curate internet-scale datasets. In this work, we examine the environmental, social, and economic costs of large-scale data in AI through a sustainability lens. We argue that the field is shifting from building models from data to actively creating data for building models. We characterise this transition as hyper-datafication, which marks a critical juncture for the future of frontier AI and its societal impacts. To quantify and contextualise data-related costs, we analyse approximately 550,000 datasets from the Hugging Face Hub, focusing on dataset growth, storage-related energy consumption and carbon footprint, and societal representation using language data. We complement this analysis with qualitative responses from data workers in Kenya to examine the labour involved, including direct employment by big tech corporations and exposure to graphic content. We further draw on external data sources to substantiate our findings by illustrating the global disparity in data centre infrastructure. Our analyses reveal that hyper-datafication does not merely increase resource consumption but systematically redistributes environmental burdens, labour risks, and representational harms toward the Global South, precarious data workers, and under-represented cultures. Thus, we propose Data PROOFS recommendations spanning provenance, resource awareness, ownership, openness, frugality, and standards to mitigate these costs. Our work aims to make visible the often-overlooked costs of data that underpin frontier AI and to stimulate broader debate within the research community and beyond.