🤖 AI Summary
Constructing high-quality labeled datasets for software engineering (SE) is costly and poorly scalable. To address this, we propose SPICE: an automated labeling pipeline leveraging large language models (LLMs), which innovatively integrates context-aware code navigation, chain-of-thought prompting, and multi-round consensus voting to achieve expert-level fine-grained annotations—including issue clarity, test coverage, and effort estimation. Evaluated on the SWE-bench Verified benchmark, SPICE achieves strong inter-annotator agreement with human experts (Cohen’s κ > 0.85). Its per-sample labeling cost drops from $100 to $0.0051—reducing expense by 99.995%. Furthermore, we release SPICE Bench, a publicly available, high-quality dataset comprising 6,802 annotated instances. This resource establishes a reproducible and scalable data infrastructure for training and evaluating SE foundation models.
📝 Abstract
High-quality labeled datasets are crucial for training and evaluating foundation models in software engineering, but creating them is often prohibitively expensive and labor-intensive. We introduce SPICE, a scalable, automated pipeline for labeling SWE-bench-style datasets with annotations for issue clarity, test coverage, and effort estimation. SPICE combines context-aware code navigation, rationale-driven prompting, and multi-pass consensus to produce labels that closely approximate expert annotations. SPICE's design was informed by our own experience and frustration in labeling more than 800 instances from SWE-Gym. SPICE achieves strong agreement with human-labeled SWE-bench Verified data while reducing the cost of labeling 1,000 instances from around $100,000 (manual annotation) to just $5.10. These results demonstrate SPICE's potential to enable cost-effective, large-scale dataset creation for SE-focused FMs. To support the community, we release both SPICE tool and SPICE Bench, a new dataset of 6,802 SPICE-labeled instances curated from 291 open-source projects in SWE-Gym (over 13x larger than SWE-bench Verified).