An Analyst-Inspector Framework for Evaluating Reproducibility of LLMs in Data Science

📅 2025-02-23

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

Large language models (LLMs) exhibit poor code reproducibility and lack systematic evaluation in data science tasks. Method: We propose an Analyst-Reviewer dual-role framework—the first LLM evaluation paradigm explicitly designed for computational reproducibility. We formally define and quantify workflow sufficiency and completeness for reproducing functionally equivalent code; design two novel reproducibility-enhancing prompting strategies; and develop the first principle-driven, automated evaluation framework integrating rule-based validation and program analysis across three datasets and 1,032 tasks. Contribution/Results: Evaluating five state-of-the-art models, we find a strong correlation between reproducibility and accuracy; our prompting strategies significantly improve average reproducibility rates; and all code is publicly released.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) have demonstrated potential for data science tasks via code generation. However, the exploratory nature of data science, alongside the stochastic and opaque outputs of LLMs, raise concerns about their reliability. While prior work focuses on benchmarking LLM accuracy, reproducibility remains underexplored, despite being critical to establishing trust in LLM-driven analysis. We propose a novel analyst-inspector framework to automatically evaluate and enforce the reproducibility of LLM-generated data science workflows - the first rigorous approach to the best of our knowledge. Defining reproducibility as the sufficiency and completeness of workflows for reproducing functionally equivalent code, this framework enforces computational reproducibility principles, ensuring transparent, well-documented LLM workflows while minimizing reliance on implicit model assumptions. Using this framework, we systematically evaluate five state-of-the-art LLMs on 1,032 data analysis tasks across three diverse benchmark datasets. We also introduce two novel reproducibility-enhancing prompting strategies. Our results show that higher reproducibility strongly correlates with improved accuracy and reproducibility-enhancing prompts are effective, demonstrating structured prompting's potential to enhance automated data science workflows and enable transparent, robust AI-driven analysis. Our code is publicly available.

Problem

Research questions and friction points this paper is trying to address.

Evaluate reproducibility of LLMs

Enhance transparency in LLM workflows

Introduce reproducibility-enhancing prompting strategies

Innovation

Methods, ideas, or system contributions that make the work stand out.

Analyst-Inspector Framework for reproducibility

Computational reproducibility principles enforcement

Reproducibility-enhancing prompting strategies introduced

🔎 Similar Papers

ScienceAgentBench: Toward Rigorous Assessment of Language Agents for Data-Driven Scientific Discovery