🤖 AI Summary
This study addresses the widespread issue of “permissive license laundering” in open-source AI ecosystems, where models, datasets, and applications labeled as compliant with permissive licenses such as MIT or Apache-2.0 often lack required license texts, copyright notices, or upstream attributions, thereby introducing legal compliance risks. For the first time, this work quantifies the problem across the entire AI supply chain by combining automated crawling, metadata analysis, and manual verification to audit 3,338 datasets, 6,664 models, and 28,516 applications on Hugging Face and GitHub. The findings reveal that 96.5% of datasets and 95.8% of models are non-compliant, with only 5.75% of downstream applications preserving complete license statements. The study advocates determining license validity through legal documents rather than metadata alone and introduces a reproducible, large-scale compliance auditing framework.
📝 Abstract
Permissive licenses like MIT, Apache-2.0, and BSD-3-Clause dominate open-source AI, signaling that artifacts like models, datasets, and code can be freely used, modified, and redistributed. However, these licenses carry mandatory requirements: include the full license text, provide a copyright notice, and preserve upstream attribution, that remain unverified at scale. Failure to meet these conditions can place reuse outside the scope of the license, effectively leaving AI artifacts under default copyright for those uses and exposing downstream users to litigation. We call this phenomenon ``permissive washing'': labeling AI artifacts as free to use, while omitting the legal documentation required to make that label actionable. To assess how widespread permissive washing is in the AI supply chain, we empirically audit 124,278 dataset $\rightarrow$ model $\rightarrow$ application supply chains, spanning 3,338 datasets, 6,664 models, and 28,516 applications across Hugging Face and GitHub. We find that an astonishing 96.5\% of datasets and 95.8\% of models lack the required license text, only 2.3\% of datasets and 3.2\% of models satisfy both license text and copyright requirements, and even when upstream artifacts provide complete licensing evidence, attribution rarely propagates downstream: only 27.59\% of models preserve compliant dataset notices and only 5.75\% of applications preserve compliant model notices (with just 6.38\% preserving any linked upstream notice). Practitioners cannot assume permissive labels confer the rights they claim: license files and notices, not metadata, are the source of legal truth. To support future research, we release our full audit dataset and reproducible pipeline.