Croissant Baker: Metadata Generation for Discoverable, Governable, and Reusable ML Datasets

📅 2026-05-14
📈 Citations: 0
Influential: 0
📄 PDF

career value

243K/year
🤖 AI Summary
This work addresses the limitations of existing Croissant metadata generation approaches, which rely on public platforms and struggle to accommodate governed or large-scale local datasets. The authors propose the first open-source, local-first command-line tool that directly generates Croissant-compliant JSON-LD metadata from local directories via a modular processor registration mechanism, supporting mainstream formats such as Parquet. By eliminating dependence on external platforms, this method significantly enhances the discoverability and reusability of private, high-value datasets. Experimental evaluation across more than 140 datasets—including MIMIC-IV with 886 million rows—demonstrates that the generated metadata achieves 97–100% accuracy, matching or exceeding that of manual curation or standard methods.
📝 Abstract
Croissant has emerged as the metadata standard for machine learning datasets, providing a structured, JSON-LD-based format that makes dataset discovery, automated ingestion, and reproducible analysis machine-checkable across ML platforms. Adoption has accelerated, and NeurIPS now requires Croissant metadata in every submission to its dataset tracks. Yet in practice Croissant generation usually starts with uploading data to a public platform, a path infeasible for governed and large local repositories that hold much of the high-value data ML increasingly relies on. We release Croissant Baker, a local-first, open-source command-line tool that generates validated Croissant metadata directly from a dataset directory through a modular handler registry. We evaluate Croissant Baker on over 140 datasets, scaling to MIMIC-IV at 886 million rows and 374 Parquet files. On held-out comparisons against producer-authored or standards-derived ground truth, Croissant Baker reaches 97-100% agreement across multiple domains.
Problem

Research questions and friction points this paper is trying to address.

Croissant
metadata generation
machine learning datasets
local repositories
data governance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Croissant metadata
local-first tool
modular handler registry
dataset discoverability
machine-checkable metadata
🔎 Similar Papers
No similar papers found.
R
Rafi Al Attrach
Technical University of Munich
Rajna Fani
Rajna Fani
TUM, MIT
Natural Language ProcessingMachine LearningHealthcare AIBias
S
Sebastian Lobentanzer
Helmholtz Munich
J
Joan Giner-Miguelez
Barcelona Supercomputing Center
D
Debanshu Das
Google
V
Varuni H. K.
Couchbase
N
Nobin Sarwar
University of Maryland, Baltimore County
R
Rajat Ghosh
Nutanix
Anwai Archit
Anwai Archit
PhD Candidate, University of Göttingen
Biomedical Image AnalysisMachine LearningComputer Vision
S
Surbhi Motghare
Salesforce
C
Christina Conrad Parry
Sage Bionetworks
Luis Oala
Luis Oala
Founder and Chief AI Officer at Brickroad
Machine Learning
L
Lara Grosso
Harvard University
Joaquin Vanschoren
Joaquin Vanschoren
Eindhoven University of Technology; Google Deepmind (Visiting)
Artificial IntelligenceMachine Learning
Steffen Vogler
Steffen Vogler
BAYER - Radiology R&D
AImachine learningMedical AIimage analysiscomputational life science
S
Sujata Goswami
Independent Researcher
E
Eric S. Rosenthal
Massachusetts General Hospital
M
Marzyeh Ghassemi
Massachusetts Institute of Technology
M
Matthew McDermott
Columbia University
Tom Pollard
Tom Pollard
Massachusetts Institute of Technology
machine learning for health