Is Open Source the Future of AI? A Data-Driven Approach

📅 2025-01-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates the practical efficacy of open-source large language models (LLMs) in addressing privacy, transparency, and misuse governance. We propose the first multidimensional evaluation framework—spanning weight openness, community engagement, architectural evolution, and performance dynamics—integrated with version trajectory analysis, performance attribution modeling, and quantitative analysis of pull requests, commits, and issues across 2020–2024. Results show that incremental open-sourcing significantly enhances governance feasibility; active community contributions improve inference efficiency by 19% on average, with decoder-only architectures yielding the largest gains; and collaborative open development reduces parameter count by 12–37% while maintaining accuracy loss within 1.8%. These findings bridge the gap between AI governance policy discourse and empirical evidence, providing data-driven foundations for optimizing open-source paradigms and strengthening responsible AI governance.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) have become central in academia and industry, raising concerns about privacy, transparency, and misuse. A key issue is the trustworthiness of proprietary models, with open-sourcing often proposed as a solution. However, open-sourcing presents challenges, including potential misuse, financial disincentives, and intellectual property concerns. Proprietary models, backed by private sector resources, are better positioned for return on investment. There are also other approaches that lie somewhere on the spectrum between completely open-source and proprietary. These can largely be categorised into open-source usage limitations protected by licensing, partially open-source (open weights) models, hybrid approaches where obsolete model versions are open-sourced, while competitive versions with market value remain proprietary. Currently, discussions on where on the spectrum future models should fall on remains unbacked and mostly opinionated where industry leaders are weighing in on the discussion. In this paper, we present a data-driven approach by compiling data on open-source development of LLMs, and their contributions in terms of improvements, modifications, and methods. Our goal is to avoid supporting either extreme but rather present data that will support future discussions both by industry experts as well as policy makers. Our findings indicate that open-source contributions can enhance model performance, with trends such as reduced model size and manageable accuracy loss. We also identify positive community engagement patterns and architectures that benefit most from open contributions.
Problem

Research questions and friction points this paper is trying to address.

Large Language Models
Open-source vs Proprietary
Privacy and Transparency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Data-driven Analysis
Open-source Development in AI
Community Engagement and Model Performance
🔎 Similar Papers
No similar papers found.
D
Domen Vake
DIST, UP FAMNIT, Glagoljaška 8, 6000 Koper, Slovenia; IP, InnoRenew CoE, Livade 6a, 6310 Izola, Slovenia
B
Bogdan Šinik
DIST, UP FAMNIT, Glagoljaška 8, 6000 Koper, Slovenia
Jernej Vičič
Jernej Vičič
University of Primorska, FAMNIT and Research Centre of the Slovenian Academy of Sciences and Arts
Machine translationLanguage technologiesDLTblockchain
A
Aleksandar Tošić
DIST, UP FAMNIT, Glagoljaška 8, 6000 Koper, Slovenia; IP, InnoRenew CoE, Livade 6a, 6310 Izola, Slovenia