An Analyst-Inspector Framework for Evaluating Reproducibility of LLMs in Data Science

๐Ÿ“… 2025-02-23
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Large language models (LLMs) exhibit poor code reproducibility and lack systematic evaluation in data science tasks. Method: We propose an Analyst-Reviewer dual-role frameworkโ€”the first LLM evaluation paradigm explicitly designed for computational reproducibility. We formally define and quantify workflow sufficiency and completeness for reproducing functionally equivalent code; design two novel reproducibility-enhancing prompting strategies; and develop the first principle-driven, automated evaluation framework integrating rule-based validation and program analysis across three datasets and 1,032 tasks. Contribution/Results: Evaluating five state-of-the-art models, we find a strong correlation between reproducibility and accuracy; our prompting strategies significantly improve average reproducibility rates; and all code is publicly released.

Technology Category

Application Category

๐Ÿ“ Abstract
Large Language Models (LLMs) have demonstrated potential for data science tasks via code generation. However, the exploratory nature of data science, alongside the stochastic and opaque outputs of LLMs, raise concerns about their reliability. While prior work focuses on benchmarking LLM accuracy, reproducibility remains underexplored, despite being critical to establishing trust in LLM-driven analysis. We propose a novel analyst-inspector framework to automatically evaluate and enforce the reproducibility of LLM-generated data science workflows - the first rigorous approach to the best of our knowledge. Defining reproducibility as the sufficiency and completeness of workflows for reproducing functionally equivalent code, this framework enforces computational reproducibility principles, ensuring transparent, well-documented LLM workflows while minimizing reliance on implicit model assumptions. Using this framework, we systematically evaluate five state-of-the-art LLMs on 1,032 data analysis tasks across three diverse benchmark datasets. We also introduce two novel reproducibility-enhancing prompting strategies. Our results show that higher reproducibility strongly correlates with improved accuracy and reproducibility-enhancing prompts are effective, demonstrating structured prompting's potential to enhance automated data science workflows and enable transparent, robust AI-driven analysis. Our code is publicly available.
Problem

Research questions and friction points this paper is trying to address.

Evaluate reproducibility of LLMs
Enhance transparency in LLM workflows
Introduce reproducibility-enhancing prompting strategies
Innovation

Methods, ideas, or system contributions that make the work stand out.

Analyst-Inspector Framework for reproducibility
Computational reproducibility principles enforcement
Reproducibility-enhancing prompting strategies introduced
๐Ÿ”Ž Similar Papers
No similar papers found.
Q
Qiuhai Zeng
Pennsylvania State University
Claire Jin
Claire Jin
Carnegie Mellon University
multimodal LLMautonomous agentsNLP for perception & controlmultimodal machine learning
X
Xinyue Wang
Pennsylvania State University
Y
Yuhan Zheng
International Monetary Fund
Q
Qunhua Li
Pennsylvania State University