How do data owners say no? A case study of data consent mechanisms in web-scraped vision-language AI training datasets

📅 2025-11-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the critical challenge of respecting data owners’ intent and ensuring copyright compliance in training AI vision-language models. Methodologically, it integrates sample-level signals (e.g., copyright notices, watermarks, metadata) with domain-level policies (e.g., robots.txt, terms of service), combining statistical estimation and content detection to systematically characterize multi-channel opt-out signals—marking the first such comprehensive analysis. Key findings reveal: (i) at least 122 million samples in CommonPool contain explicit copyright indicators; (ii) 60% of top-domain training data originates from websites explicitly prohibiting web crawling; and (iii) 9–13% of images contain watermarks, yet current detection methods exhibit high false-negative rates. These results expose substantial gaps in AI data pipelines’ ability to recognize and respond to refusal signals. Accordingly, the study proposes a unified data consent framework tailored for AI training, offering an empirically grounded foundation and actionable pathway toward accountable, auditable, and legally compliant data governance.

Technology Category

Application Category

📝 Abstract
The internet has become the main source of data to train modern text-to-image or vision-language models, yet it is increasingly unclear whether web-scale data collection practices for training AI systems adequately respect data owners'wishes. Ignoring the owner's indication of consent around data usage not only raises ethical concerns but also has recently been elevated into lawsuits around copyright infringement cases. In this work, we aim to reveal information about data owners'consent to AI scraping and training, and study how it's expressed in DataComp, a popular dataset of 12.8 billion text-image pairs. We examine both the sample-level information, including the copyright notice, watermarking, and metadata, and the web-domain-level information, such as a site's Terms of Service (ToS) and Robots Exclusion Protocol. We estimate at least 122M of samples exhibit some indication of copyright notice in CommonPool, and find that 60% of the samples in the top 50 domains come from websites with ToS that prohibit scraping. Furthermore, we estimate 9-13% with 95% confidence interval of samples from CommonPool to contain watermarks, where existing watermark detection methods fail to capture them in high fidelity. Our holistic methods and findings show that data owners rely on various channels to convey data consent, of which current AI data collection pipelines do not entirely respect. These findings highlight the limitations of the current dataset curation/release practice and the need for a unified data consent framework taking AI purposes into consideration.
Problem

Research questions and friction points this paper is trying to address.

Investigating how data owners express consent for AI data scraping and training
Analyzing copyright notices, watermarks and ToS compliance in web datasets
Revealing current AI data collection fails to respect owner consent mechanisms
Innovation

Methods, ideas, or system contributions that make the work stand out.

Analyzed copyright notices and metadata in datasets
Examined website Terms of Service and Robots protocols
Developed watermark detection methods for consent verification
C
Chung Peng Lee
Princeton University
R
Rachel Hong
University of Washington
H
Harry Jiang
Carnegie Mellon University
A
Aster Plotnik
University of Toronto
William Agnew
William Agnew
Postdoc, Carnegie Mellon University
Artificial IntelligenceAlgorithmsSecurity
Jamie Morgenstern
Jamie Morgenstern
University of Washington
Algorithmic game theorymachine learningprivacyapproximation algorithms