Towards Best Practices for Open Datasets for LLM Training

📅 2025-01-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses systemic challenges in large language model (LLM) training—including high copyright risk, opaque data provenance, missing metadata, and insufficient cross-domain collaboration—by proposing the first open dataset construction framework specifically designed for LLM training. Methodologically, it integrates digital humanities principles, Schema.org metadata extensions, automated copyright status identification, and multi-source heterogeneous data cleaning and provenance tracking to unify legal compliance, technical traceability, and social collaboration. Key contributions include: (1) establishing practical standards and governance pathways for responsible open LLM training datasets; (2) releasing a reusable construction guideline, a legal compatibility assessment toolkit, and an interdisciplinary collaboration prototype; and (3) laying the foundation for the first large-scale, high-quality, fully openly licensed LLM training dataset—thereby significantly enhancing transparency, accountability, and sustainable innovation in AI development.

Technology Category

Application Category

📝 Abstract
Many AI companies are training their large language models (LLMs) on data without the permission of the copyright owners. The permissibility of doing so varies by jurisdiction: in countries like the EU and Japan, this is allowed under certain restrictions, while in the United States, the legal landscape is more ambiguous. Regardless of the legal status, concerns from creative producers have led to several high-profile copyright lawsuits, and the threat of litigation is commonly cited as a reason for the recent trend towards minimizing the information shared about training datasets by both corporate and public interest actors. This trend in limiting data information causes harm by hindering transparency, accountability, and innovation in the broader ecosystem by denying researchers, auditors, and impacted individuals access to the information needed to understand AI models. While this could be mitigated by training language models on open access and public domain data, at the time of writing, there are no such models (trained at a meaningful scale) due to the substantial technical and sociological challenges in assembling the necessary corpus. These challenges include incomplete and unreliable metadata, the cost and complexity of digitizing physical records, and the diverse set of legal and technical skills required to ensure relevance and responsibility in a quickly changing landscape. Building towards a future where AI systems can be trained on openly licensed data that is responsibly curated and governed requires collaboration across legal, technical, and policy domains, along with investments in metadata standards, digitization, and fostering a culture of openness.
Problem

Research questions and friction points this paper is trying to address.

Copyright Law
Transparent AI Research
High-Quality Dataset
Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-domain Collaboration
Data Standardization
AI Transparency
🔎 Similar Papers
No similar papers found.
S
Stefan Baack
Project leads
Stella Biderman
Stella Biderman
EleutherAI
Natural Language ProcessingArtificial IntelligenceLanguage ModelingDeep Learning
K
Kasia Odrozek
Project leads
A
Aviya Skowron
Project leads
A
Ayah Bdeir
Top contributors
J
Jillian Bommarito
Top contributors
Jennifer Ding
Jennifer Ding
The Alan Turing Institute
M
Maximilian Gahntz
Top contributors
P
Paul Keller
Top contributors
P
Pierre-Carl Langlais
Top contributors
G
Greg Lindahl
Top contributors
S
Sebastian Majstorovic
Top contributors
N
Nik Marda
Top contributors
Guilherme Penedo
Guilherme Penedo
ML Research Engineer at 🤗 HuggingFace
Maarten Van Segbroeck
Maarten Van Segbroeck
Gretel
Machine LearningDeep LearningArtificial IntelligenceSignal ProcessingSpeech Recognition
Jennifer Wang
Jennifer Wang
Unknown affiliation
Leandro von Werra
Leandro von Werra
Hugging Face
M
Mitchell Baker
Contributors
J
Julie Beliao
Contributors
K
Kasia Chmielinski
Contributors
Marzieh Fadaee
Marzieh Fadaee
Staff Research Scientist, Cohere Labs
Computational LinguisticsMachine LearningNatural Language ProcessingMultilingual NLP
L
Lisa Gutermuth
Contributors
H
Hynek Kydl'ivcek
Contributors
G
Greg Leppert
Contributors
E
EM Lewis-Jong
Contributors
S
Solana Larsen
Contributors
Shayne Longpre
Shayne Longpre
MIT, Stanford, Apple
Deep LearningNatural Language Understanding
A
Angela Oduor Lungati
Contributors
C
Cullen Miller
Contributors
Victor Miller
Victor Miller
Contributors
Max Ryabinin
Max Ryabinin
Together AI
deep learningnatural language processingdistributed training
K
Kathleen Siminyu
Contributors
A
Andrew Strait
Contributors
M
Mark Surman
Contributors
A
Anna Tumad'ottir
Contributors
Maurice Weber
Maurice Weber
Together AI
Large Language ModelsKnowledge DistillationMachine Learning
R
Rebecca Weiss
Contributors
L
Lee White
Contributors
Thomas Wolf
Thomas Wolf
Co-founder at HuggingFace
machine learningdeep learningnatural language processingcomputational linguisticsartificial