SPICE: An Automated SWE-Bench Labeling Pipeline for Issue Clarity, Test Coverage, and Effort Estimation

📅 2025-07-11

📈 Citations: 0

✨ Influential: 0

career value

155K/year

🤖 AI Summary

Constructing high-quality labeled datasets for software engineering (SE) is costly and poorly scalable. To address this, we propose SPICE: an automated labeling pipeline leveraging large language models (LLMs), which innovatively integrates context-aware code navigation, chain-of-thought prompting, and multi-round consensus voting to achieve expert-level fine-grained annotations—including issue clarity, test coverage, and effort estimation. Evaluated on the SWE-bench Verified benchmark, SPICE achieves strong inter-annotator agreement with human experts (Cohen’s κ > 0.85). Its per-sample labeling cost drops from $100 to $0.0051—reducing expense by 99.995%. Furthermore, we release SPICE Bench, a publicly available, high-quality dataset comprising 6,802 annotated instances. This resource establishes a reproducible and scalable data infrastructure for training and evaluating SE foundation models.

Technology Category

Application Category

📝 Abstract

High-quality labeled datasets are crucial for training and evaluating foundation models in software engineering, but creating them is often prohibitively expensive and labor-intensive. We introduce SPICE, a scalable, automated pipeline for labeling SWE-bench-style datasets with annotations for issue clarity, test coverage, and effort estimation. SPICE combines context-aware code navigation, rationale-driven prompting, and multi-pass consensus to produce labels that closely approximate expert annotations. SPICE's design was informed by our own experience and frustration in labeling more than 800 instances from SWE-Gym. SPICE achieves strong agreement with human-labeled SWE-bench Verified data while reducing the cost of labeling 1,000 instances from around $100,000 (manual annotation) to just $5.10. These results demonstrate SPICE's potential to enable cost-effective, large-scale dataset creation for SE-focused FMs. To support the community, we release both SPICE tool and SPICE Bench, a new dataset of 6,802 SPICE-labeled instances curated from 291 open-source projects in SWE-Gym (over 13x larger than SWE-bench Verified).

Problem

Research questions and friction points this paper is trying to address.

Automates labeling for issue clarity, test coverage, effort estimation

Reduces high costs of manual dataset annotation

Enables scalable creation of SE-focused foundation model datasets

Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated pipeline for labeling datasets

Combines context-aware code navigation

Reduces labeling cost significantly

🔎 Similar Papers

TestGenEval: A Real World Unit Test Generation and Test Completion Benchmark