🤖 AI Summary
This work addresses the persistent challenge of reproducibility in machine learning research, which is often hindered by ambiguous implementation details and environment-specific dependencies. The authors propose a declarative, machine-executable metadata format that formally decouples task specifications from concrete implementations, abstracting low-level details into high-level semantic descriptions. Leveraging this format, they develop an automated pipeline powered by large language models and intelligent agents capable of generating functionally correct and results-consistent reproduction code from scratch—without reliance on original implementations. Empirical evaluations demonstrate that the approach can automatically reconstruct existing benchmarks and achieve conceptual reproducibility, substantially enhancing the reliability and automation of model evaluation.
📝 Abstract
Reproducibility is fundamental to the scientific method, yet remains a critical challenge in machine learning. Contributing factors include underspecified execution details and brittle software environments. Human-centric remedies, such as checklists and manual verification, help but require intensive effort and fail to scale. To address this, we introduce Croissant Tasks: a declarative, machine-actionable metadata format that abstracts low-level implementation details into high-level specifications. This format enables conceptual reproducibility: verifying claims via independent, agent-generated implementations rather than brittle source code replication. We contribute: (1) the Croissant Tasks specification, formally decoupling task problem from solution; (2) an automated LLM pipeline that retrofits existing benchmarks into this format; and (3) empirical validation showing autonomous agents can ingest these specifications to generate functional, accurate reproduction pipelines from scratch. We envision this format as a new foundation for automated and conceptual reproducibility in machine learning.