SpecBench: Evaluating Specification-Level Reasoning for Software Engineering LLM Agents

📅 2026-05-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses a critical gap in the evaluation of software engineering agents, which has predominantly focused on code implementation while neglecting their ability to detect and correct defects in requirements specifications—such as omissions, ambiguities, and inconsistencies. We propose the first evaluation framework centered on specification-level reasoning, constructing a benchmark based on the RFC (Request for Comments) processes of open-source projects. The framework requires agents to systematically identify design flaws by synthesizing initial proposals, code repositories, and historical discussions. Evaluations across five repositories, including Kubernetes and React, reveal that even the best-performing model (GPT-5.4) achieves only 44.4% accuracy, highlighting a significant limitation in current agents’ capacity for requirement analysis and design review without execution feedback. This study thus fills a crucial void in assessing agent capabilities at the specification stage.
📝 Abstract
Software engineering (SWE) agents are transitioning from code generation to full software development lifecycle automation. A critical phase in this lifecycle is specification design: transforming initial proposals into carefully considered requirements through expert review. Existing benchmarks such as SWE-Bench are implementation-focused by measuring the agent's ability to generate code given fixed, precise design requirements. This formulation assumes specifications are correct and complete. In real-world complex and critical software systems, initial specifications are often incomplete and flawed, requiring extensive expert reviews and revisions before being accepted for implementation. To fill this gap, we introduce SpecBench to evaluate specification-level reasoning: the ability to generate complete, unambiguous, consistent, and correct system specifications. SpecBench tasks are derived from the Request for Comments (RFC) process used by mature open-source projects. For each task, an agent is given an initial design proposal, the project codebase, and all past project RFC discussions. The agent is tasked with identifying specification deficiencies: omissions, ambiguities, inconsistencies, or incorrect assumptions in the initial proposal. We evaluate predictions against critiques raised by expert maintainers during historical RFC reviews. SpecBench contains tasks from 5 diverse repositories: Kubernetes, React, Rust, TVM, and vLLM. We evaluate state-of-the-art SWE agents on SpecBench, analyzing their capacity to reason about system design without execution feedback. The best performing agent, GPT-5.4, achieves 44.4% accuracy.
Problem

Research questions and friction points this paper is trying to address.

specification-level reasoning
software engineering agents
requirements specification
RFC process
benchmarking
Innovation

Methods, ideas, or system contributions that make the work stand out.

specification-level reasoning
software engineering agents
RFC-based benchmark
requirements validation
LLM evaluation