Artisan: Agentic Artifact Evaluation

📅 2026-02-10

📈 Citations: 0

✨ Influential: 0

career value

153K/year

🤖 AI Summary

This work addresses the high cost, labor intensity, and poor reproducibility inherent in manual evaluation of software engineering artifacts. To this end, we propose Artisan, an intelligent agent powered by large language models that formalizes research reproduction as a standalone code generation task for the first time. Artisan incorporates a novel automated evaluation mechanism that deliberately withholds ground-truth results to guide the agent toward accurate reproductions without leakage. We also introduce Artisan-Bench, the first benchmark specifically designed for automated assessment in software engineering. Experimental results demonstrate that Artisan successfully generates 44 correct reproduction scripts out of 60 tasks, achieving 3.14× the performance of baseline methods at an average cost of 0.45 hours per task, and uncovers 20 previously unknown errors in published papers or their associated artifacts.

Technology Category

Application Category

📝 Abstract

Artifact evaluation has become standard practice in the software engineering community to ensure the reproducibility of research results. However, the current manual process is labor-intensive, and hence, done only as a one-time assessment for a subset of all papers. To support the artifact evaluation effort, we present Artisan, an automated LLM agent for reproducing research results given a paper and its artifact. The approach is enabled by two key contributions: First, we frame the reproduction problem as a code generation task where the goal is to generate a reproduction script that, when executed, reproduces the results reported in a paper. Unlike prior work on automatically reproducing research results in other domains, this formulation allows for running the script independently of the agent and for assessing the reproduction process at a fine-grained level. Second, we design automated judging mechanism that guides the agent toward the expected results without revealing them and that prevent trivial solutions, such as simply copying checked-in results. To evaluate Artisan, we introduce Artisan-Bench, the first benchmark assessing the ability to generate reproduction scripts and the first benchmark for automated artifact evaluation in software engineering. Artisan-Bench comprises 60 tasks derived from 23 software engineering papers, covering different research areas and programming languages. We validate all tasks in Artisan-Bench for reproducibility to ensure that the tasks are feasible. Our experiments show that Artisan is effective, producing 44/60 reproduction scripts and outperforming the best available baseline, a vanilla LLM agent (mini-swe-agent), by 3.14$\times$ in terms of reproduction scripts generated while taking $0.45 and 48 minutes, on average per task. Artisan also helped uncover 20 new errors in either the paper or artifact.

Problem

Research questions and friction points this paper is trying to address.

artifact evaluation

reproducibility

software engineering

research reproduction

manual assessment

Innovation

Methods, ideas, or system contributions that make the work stand out.

automated artifact evaluation

LLM agent

reproduction script generation