HuggingGraph: Understanding the Supply Chain of LLM Ecosystem

📅 2025-07-17

📈 Citations: 0

✨ Influential: 0

career value

215K/year

🤖 AI Summary

This study addresses the challenge of modeling dependencies between models and datasets in the large language model (LLM) supply chain. It introduces the first systematic, heterogeneous graph representation of the LLM supply chain on Hugging Face—comprising nearly 400,000 nodes and 450,000 edges—to capture model lineage, dataset reuse, and derivative dependencies. Methodologically, it integrates web crawling, heterogeneous graph construction, and dynamic graph analysis to uncover a core-periphery topology, power-law degree distribution, and strong temporal evolution. Key contributions include: (1) empirical validation that datasets serve as primary carriers for cross-generational transmission of model bias and risk; (2) identification of three structural properties—high sparsity, centrality concentration, and deep interconnectivity—in the supply chain; and (3) establishment of an interpretable, traceable graph foundation to support model provenance tracking, bias attribution, security compliance assessment, and governance.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) leverage deep learning to process and predict sequences of words from context, enabling them to perform various NLP tasks, such as translation, summarization, question answering, and content generation. However, the growing size and complexity of developing, training, and deploying advanced LLMs require extensive computational resources and large datasets. This creates a barrier for users. As a result, platforms that host models and datasets are widely used. For example, Hugging Face, one of the most popular platforms, hosted 1.8 million models and 450K datasets by June 2025, with no sign of slowing down. Since many LLMs are built from base models, pre-trained models, and external datasets, they can inherit vulnerabilities, biases, or malicious components from earlier models or datasets. Therefore, it is critical to understand the origin and development of these components to better detect potential risks, improve model fairness, and ensure compliance. Motivated by this, our project aims to study the relationships between models and datasets, which are core components of the LLM supply chain. First, we design a method to systematically collect LLM supply chain data. Using this data, we build a directed heterogeneous graph to model the relationships between models and datasets, resulting in a structure with 397,376 nodes and 453,469 edges. We then perform various analyses and uncover several findings, such as: (i) the LLM supply chain graph is large, sparse, and follows a power-law degree distribution; (ii) it features a densely connected core and a fragmented periphery; (iii) datasets play pivotal roles in training; (iv) strong interdependence exists between models and datasets; and (v) the graph is dynamic, with daily updates reflecting the ecosystem's ongoing evolution.

Problem

Research questions and friction points this paper is trying to address.

Analyzing supply chain risks in LLM ecosystems

Mapping relationships between models and datasets

Detecting vulnerabilities and biases in LLM components

Innovation

Methods, ideas, or system contributions that make the work stand out.

Systematically collects LLM supply chain data

Builds directed heterogeneous graph for relationships

Analyzes dynamic graph structure and dependencies

🔎 Similar Papers

LLM-Enhanced User-Item Interactions: Leveraging Edge Information for Optimized Recommendations