The ML Supply Chain in the Era of Software 2.0: Lessons Learned from Hugging Face

📅 2025-02-06

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

In the Software 2.0 era, missing documentation, unclear dependencies, and license inconsistencies in ML model supply chains critically undermine regulatory compliance and security. This paper presents the first large-scale empirical study of 760,460 models and 175,000 datasets on Hugging Face, constructing dependency graphs and proposing a documentation quality assessment framework alongside a license consistency analysis method. Results reveal that over 60% of models lack essential documentation and 32% exhibit license declaration conflicts, exposing the highly fragmented and opaque structure of ML supply chains. To enable reproducibility and community advancement, we open-source our complete web crawling infrastructure, annotated datasets, and analytical tools. This work establishes the first large-scale empirical foundation and methodological support for traceability, compliance governance, and security auditing of ML supply chains.

Technology Category

Application Category

📝 Abstract

The last decade has seen widespread adoption of Machine Learning (ML) components in software systems. This has occurred in nearly every domain, from natural language processing to computer vision. These ML components range from relatively simple neural networks to complex and resource-intensive large language models. However, despite this widespread adoption, little is known about the supply chain relationships that produce these models, which can have implications for compliance and security. In this work, we conduct an extensive analysis of 760,460 models and 175,000 datasets mined from the popular model-sharing site Hugging Face. First, we evaluate the current state of documentation in the Hugging Face supply chain, report real-world examples of shortcomings, and offer actionable suggestions for improvement. Next, we analyze the underlying structure of the extant supply chain. Finally, we explore the current licensing landscape against what was reported in prior work and discuss the unique challenges posed in this domain. Our results motivate multiple research avenues, including the need for better license management for ML models/datasets, better support for model documentation, and automated inconsistency checking and validation. We make our research infrastructure and dataset available to facilitate future research.

Problem

Research questions and friction points this paper is trying to address.

Analyzes ML model supply chain relationships

Evaluates documentation and licensing in Hugging Face

Proposes improvements for ML model management

Innovation

Methods, ideas, or system contributions that make the work stand out.

Analyzed Hugging Face models datasets

Improved ML documentation standards

Automated inconsistency validation techniques

🔎 Similar Papers

A Multivocal Review of MLOps Practices, Challenges and Open Issues