🤖 AI Summary
This study addresses the longstanding lack of systematic evaluation of reproducibility in evolutionary computation research. The authors introduce RECAP, a novel automated assessment framework powered by large language models (LLMs), which integrates a structured reproducibility checklist with human evaluation to conduct a large-scale empirical analysis of papers published in the GECCO conference over the past decade. Their findings reveal that the average reproducibility completeness score is 0.62, with only 36.90% of papers providing supplementary materials. RECAP demonstrates substantial agreement with human assessors (Cohen’s κ = 0.67), validating the feasibility and effectiveness of LLM-driven automated reproducibility evaluation. This work establishes a scalable new paradigm for reproducibility assessment in the field.
📝 Abstract
Reproducibility is an important requirement in evolutionary computation, where results largely depend on computational experiments. In practice, reproducibility relies on how algorithms, experimental protocols, and artifacts are documented and shared. Despite growing awareness, there is still limited empirical evidence on the actual reproducibility levels of published work in the field. In this paper, we study the reproducibility practices in papers published in the Evolutionary Combinatorial Optimization and Metaheuristics track of the Genetic and Evolutionary Computation Conference over a ten-year period. We introduce a structured reproducibility checklist and apply it through a systematic manual assessment of the selected corpus. In addition, we propose RECAP (REproducibility Checklist Automation Pipeline), an LLM-based system that automatically evaluates reproducibility signals from paper text and associated code repositories. Our analysis shows that papers achieve an average completeness score of 0.62, and that 36.90% of them provide additional material beyond the manuscript itself. We demonstrate that automated assessment is feasible: RECAP achieves substantial agreement with human evaluators (Cohen's k of 0.67). Together, these results highlight persistent gaps in reproducibility reporting and suggest that automated tools can effectively support large-scale, systematic monitoring of reproducibility practices.