Software Dependencies 2.0: An Empirical Study of Reuse and Integration of Pre-Trained Models in Open-Source Projects

📅 2025-09-07

📈 Citations: 0

✨ Influential: 0

career value

174K/year

🤖 AI Summary

Pre-trained models (PTMs) are increasingly adopted as novel software dependencies—termed “Software Dependencies 2.0”—yet their reuse and integration practices in open-source projects, along with associated maintainability and reliability implications, remain poorly understood. Method: We conduct the first systematic empirical study using a mixed-methods approach: statistically significant sampling of 401 GitHub repositories from the PeaTMOSS dataset, combined with quantitative pattern mining and in-depth qualitative case analysis. Contribution/Results: We find widespread deficiencies in PTM versioning, documentation, and dependency tracking. We identify three canonical PTM reuse pipelines and uncover complex, cross-stage, asymmetric inter-model dependency structures. Critically, we empirically define and characterize the first structured usage paradigm for Software Dependencies 2.0, establishing foundational theory and evidence to guide model-aware software engineering practices and tooling.

Technology Category

Application Category

📝 Abstract

Pre-trained models (PTMs) are machine learning models that have been trained in advance, often on large-scale data, and can be reused for new tasks, thereby reducing the need for costly training from scratch. Their widespread adoption introduces a new class of software dependency, which we term Software Dependencies 2.0, extending beyond conventional libraries to learned behaviors embodied in trained models and their associated artifacts. The integration of PTMs as software dependencies in real projects remains unclear, potentially threatening maintainability and reliability of modern software systems that increasingly rely on them. Objective: In this study, we investigate Software Dependencies 2.0 in open-source software (OSS) projects by examining the reuse of PTMs, with a focus on how developers manage and integrate these models. Specifically, we seek to understand: (1) how OSS projects structure and document their PTM dependencies; (2) what stages and organizational patterns emerge in the reuse pipelines of PTMs within these projects; and (3) the interactions among PTMs and other learned components across pipeline stages. We conduct a mixed-methods analysis of a statistically significant random sample of 401 GitHub repositories from the PeaTMOSS dataset (28,575 repositories reusing PTMs from Hugging Face and PyTorch Hub). We quantitatively examine PTM reuse by identifying patterns and qualitatively investigate how developers integrate and manage these models in practice.

Problem

Research questions and friction points this paper is trying to address.

Studying reuse and integration of pre-trained models in open-source projects

Examining how developers manage PTM dependencies and maintainability risks

Analyzing organizational patterns in PTM reuse pipelines across repositories

Innovation

Methods, ideas, or system contributions that make the work stand out.

Empirical study on PTM reuse in open-source projects

Mixed-methods analysis of 401 GitHub repositories

Examining PTM dependency management and integration patterns

🔎 Similar Papers

A Large-Scale Study of Model Integration in ML-Enabled Software Systems