🤖 AI Summary
This paper addresses pervasive cultural biases, web-crawling quality deficiencies, and the systemic invisibility of labor in machine learning datasets. We propose a participatory data construction paradigm, implementing a collaborative crowdsourcing framework, culturally situated collection protocols, and qualitative reflexive analysis—grounded in practice logs and interviews—to co-create World Wide Dishes (WWD), the first high-quality, multimodal food culture dataset built collectively by globally diverse communities. We systematically identify and theorize four categories of critical invisible labor: community trust-building, participatory accessibility design, data production support, and interpretation of data–culture relationships. Our contributions include: (1) a reusable, open framework for participatory dataset construction; (2) empirical validation of decentralized, anti-colonial data practices; and (3) interdisciplinary innovation bridging CSCW and ML data governance.
📝 Abstract
We provide a window into the process of constructing a dataset for machine learning (ML) applications by reflecting on the process of building World Wide Dishes (WWD), an image and text dataset consisting of culinary dishes and their associated customs from around the world. WWD takes a participatory approach to dataset creation: community members guide the design of the research process and engage in crowdsourcing efforts to build the dataset. WWD responds to calls in ML to address the limitations of web-scraped Internet datasets with curated, high-quality data incorporating localised expertise and knowledge. Our approach supports decentralised contributions from communities that have not historically contributed to datasets as a result of a variety of systemic factors. We contribute empirical evidence of the invisible labour of participatory design work by analysing reflections from the research team behind WWD. In doing so, we extend computer-supported cooperative work (CSCW) literature that examines the post-hoc impacts of datasets when deployed in ML applications by providing a window into the dataset construction process. We surface four dimensions of invisible labour in participatory dataset construction: building trust with community members, making participation accessible, supporting data production, and understanding the relationship between data and culture. This paper builds upon the rich participatory design literature within CSCW to guide how future efforts to apply participatory design to dataset construction can be designed in a way that attends to the dynamic, collaborative, and fundamentally human processes of dataset creation.