From Hugging Face to GitHub: Tracing License Drift in the Open-Source AI Ecosystem

📅 2025-09-11

📈 Citations: 0

✨ Influential: 0

career value

217K/year

🤖 AI Summary

Widespread license conflicts in the open-source AI ecosystem pose significant legal and ethical risks, yet systematic empirical analysis remains lacking. Method: This paper presents the first end-to-end license compliance audit spanning Hugging Face datasets/models to GitHub downstream applications—covering 364K datasets, 1.6M models, and 140K open-source projects. We design a scalable rule engine grounded in SPDX standards and AI-specific licensing clauses, incorporating nearly 200 formalized license rules for automated conflict detection. Contribution/Results: We identify license drift in 35.5% of integrated models and license conflicts in 86.4% of software applications. We release the first large-scale open-source AI license audit dataset and an open-source prototype detection tool. Our work establishes a theoretical foundation, empirical evidence, and technical infrastructure for automated license compliance governance in AI ecosystems.

Technology Category

Application Category

📝 Abstract

Hidden license conflicts in the open-source AI ecosystem pose serious legal and ethical risks, exposing organizations to potential litigation and users to undisclosed risk. However, the field lacks a data-driven understanding of how frequently these conflicts occur, where they originate, and which communities are most affected. We present the first end-to-end audit of licenses for datasets and models on Hugging Face, as well as their downstream integration into open-source software applications, covering 364 thousand datasets, 1.6 million models, and 140 thousand GitHub projects. Our empirical analysis reveals systemic non-compliance in which 35.5% of model-to-application transitions eliminate restrictive license clauses by relicensing under permissive terms. In addition, we prototype an extensible rule engine that encodes almost 200 SPDX and model-specific clauses for detecting license conflicts, which can solve 86.4% of license conflicts in software applications. To support future research, we release our dataset and the prototype engine. Our study highlights license compliance as a critical governance challenge in open-source AI and provides both the data and tools necessary to enable automated, AI-aware compliance at scale.

Problem

Research questions and friction points this paper is trying to address.

Detecting hidden license conflicts in open-source AI ecosystem

Measuring frequency and origin of license compliance violations

Identifying communities most affected by AI license drift

Innovation

Methods, ideas, or system contributions that make the work stand out.

End-to-end audit of AI licenses

Extensible rule engine for conflicts

Dataset and prototype engine release

🔎 Similar Papers

LiCoEval: Evaluating LLMs on License Compliance in Code Generation