🤖 AI Summary
Widespread license conflicts in the open-source AI ecosystem pose significant legal and ethical risks, yet systematic empirical analysis remains lacking.
Method: This paper presents the first end-to-end license compliance audit spanning Hugging Face datasets/models to GitHub downstream applications—covering 364K datasets, 1.6M models, and 140K open-source projects. We design a scalable rule engine grounded in SPDX standards and AI-specific licensing clauses, incorporating nearly 200 formalized license rules for automated conflict detection.
Contribution/Results: We identify license drift in 35.5% of integrated models and license conflicts in 86.4% of software applications. We release the first large-scale open-source AI license audit dataset and an open-source prototype detection tool. Our work establishes a theoretical foundation, empirical evidence, and technical infrastructure for automated license compliance governance in AI ecosystems.
📝 Abstract
Hidden license conflicts in the open-source AI ecosystem pose serious legal and ethical risks, exposing organizations to potential litigation and users to undisclosed risk. However, the field lacks a data-driven understanding of how frequently these conflicts occur, where they originate, and which communities are most affected. We present the first end-to-end audit of licenses for datasets and models on Hugging Face, as well as their downstream integration into open-source software applications, covering 364 thousand datasets, 1.6 million models, and 140 thousand GitHub projects. Our empirical analysis reveals systemic non-compliance in which 35.5% of model-to-application transitions eliminate restrictive license clauses by relicensing under permissive terms. In addition, we prototype an extensible rule engine that encodes almost 200 SPDX and model-specific clauses for detecting license conflicts, which can solve 86.4% of license conflicts in software applications. To support future research, we release our dataset and the prototype engine. Our study highlights license compliance as a critical governance challenge in open-source AI and provides both the data and tools necessary to enable automated, AI-aware compliance at scale.