Bridging the Data Provenance Gap Across Text, Speech and Video

📅 2024-12-19
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses systemic deficiencies in multimodal AI training data concerning source transparency, licensing restrictions, and geographic/linguistic representativeness. Method: We conduct the first longitudinal provenance audit spanning text, speech, and video modalities (1990–2024), covering nearly 4,000 publicly available datasets. Our approach integrates fine-grained human annotation, cross-modal metadata standardization, provenance chain tracing, and quantitative analysis of geographic and linguistic coverage. Results: We find that over 80% of mainstream datasets impose non-commercial use restrictions; multilingual and multi-regional representation shows no substantive improvement post-2013; and web crawling and synthetic data dominate sourcing, with licensing constraints exhibiting implicit proliferation. We publicly release the complete audit dataset, providing an empirical foundation and methodological framework to advance data traceability, de-Westernization, and responsible AI development.

Technology Category

Application Category

📝 Abstract
Progress in AI is driven largely by the scale and quality of training data. Despite this, there is a deficit of empirical analysis examining the attributes of well-established datasets beyond text. In this work we conduct the largest and first-of-its-kind longitudinal audit across modalities--popular text, speech, and video datasets--from their detailed sourcing trends and use restrictions to their geographical and linguistic representation. Our manual analysis covers nearly 4000 public datasets between 1990-2024, spanning 608 languages, 798 sources, 659 organizations, and 67 countries. We find that multimodal machine learning applications have overwhelmingly turned to web-crawled, synthetic, and social media platforms, such as YouTube, for their training sets, eclipsing all other sources since 2019. Secondly, tracing the chain of dataset derivations we find that while less than 33% of datasets are restrictively licensed, over 80% of the source content in widely-used text, speech, and video datasets, carry non-commercial restrictions. Finally, counter to the rising number of languages and geographies represented in public AI training datasets, our audit demonstrates measures of relative geographical and multilingual representation have failed to significantly improve their coverage since 2013. We believe the breadth of our audit enables us to empirically examine trends in data sourcing, restrictions, and Western-centricity at an ecosystem-level, and that visibility into these questions are essential to progress in responsible AI. As a contribution to ongoing improvements in dataset transparency and responsible use, we release our entire multimodal audit, allowing practitioners to trace data provenance across text, speech, and video.
Problem

Research questions and friction points this paper is trying to address.

Analyzes data sourcing trends across text, speech, video.
Examines licensing restrictions in multimodal AI datasets.
Assesses geographical and linguistic representation in datasets.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Largest multimodal dataset audit
Tracing data provenance chains
Analyzing dataset sourcing restrictions
🔎 Similar Papers
No similar papers found.
Shayne Longpre
Shayne Longpre
MIT, Stanford, Apple
Deep LearningNatural Language Understanding
N
Nikhil Singh
The Data Provenance Initiative
Manuel Cherep
Manuel Cherep
PhD Student, Massachusetts Institute of Technology
Machine LearningGenerative ModelsAudioAgents
Kushagra Tiwary
Kushagra Tiwary
Massachusetts Institute of Technology
Computer VisionMachine Learning
Joanna Materzynska
Joanna Materzynska
The Data Provenance Initiative
William Brannon
William Brannon
The Data Provenance Initiative
Robert Mahari
Robert Mahari
Associate Director, Stanford CodeX Center
Computational Law
M
Manan Dey
The Data Provenance Initiative
M
Mohammed Hamdy
The Data Provenance Initiative
Nayan Saxena
Nayan Saxena
The Data Provenance Initiative
A
Ahmad Mustafa Anis
The Data Provenance Initiative
E
Emad A. Alghamdi
The Data Provenance Initiative
V
Vu Minh Chien
The Data Provenance Initiative
N
Naana Obeng-Marnu
The Data Provenance Initiative
Da Yin
Da Yin
Meta FAIR
Natural Language Processing
K
Kun Qian
The Data Provenance Initiative
Yizhi Li
Yizhi Li
University of Manchester, M-A-P
LLMReasoningPost-trainingComputational Music
M
Minnie Liang
The Data Provenance Initiative
A
An Dinh
The Data Provenance Initiative
Shrestha Mohanty
Shrestha Mohanty
Massachusetts Institute of Technology
Natural Language ProcessingMachine learningHuman Centered AI
D
Deividas Mataciunas
The Data Provenance Initiative
Tobin South
Tobin South
Massachusetts Institute of Technology
J
Jianguo Zhang
The Data Provenance Initiative
A
Ariel N. Lee
The Data Provenance Initiative
C
Campbell S. Lund
The Data Provenance Initiative
C
Christopher Klamm
The Data Provenance Initiative
Damien Sileo
Damien Sileo
Inria
Natural Language ProcessingReasoningDatasetsLLMsSynthetic data
D
Diganta Misra
The Data Provenance Initiative
E
Enrico Shippole
The Data Provenance Initiative
Kevin Klyman
Kevin Klyman
Stanford, Harvard
Foundation ModelsAI RegulationGeopolitics
J
JV Lester Miranda
The Data Provenance Initiative