SWE-AGI: Benchmarking Specification-Driven Software Construction with MoonBit in the Era of Autonomous Agents

📅 2026-02-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates whether large language models can autonomously construct production-grade software systems based on explicit specifications, with a focus on their architectural reasoning capabilities in complex, long-horizon engineering tasks. To this end, we introduce SWE-AGI—the first high-complexity, specification-driven, end-to-end benchmark built within the MoonBit ecosystem under low data leakage risk—requiring agents to strictly adhere to RFCs and authoritative standards while implementing modules such as parsers, interpreters, binary decoders, and SAT solvers under fixed API constraints. Experimental results show that GPT-5.3-Codex achieves the highest performance (86.4%), followed by Claude-Opus-4.6 (68.2%), with Kimi-2.5 leading among open-source models. Performance drops significantly with increasing task difficulty, revealing that current models still struggle to reliably support production-level development, with code comprehension emerging as the primary bottleneck.

Technology Category

Application Category

📝 Abstract
Although large language models (LLMs) have demonstrated impressive coding capabilities, their ability to autonomously build production-scale software from explicit specifications remains an open question. We introduce SWE-AGI, an open-source benchmark for evaluating end-to-end, specification-driven construction of software systems written in MoonBit. SWE-AGI tasks require LLM-based agents to implement parsers, interpreters, binary decoders, and SAT solvers strictly from authoritative standards and RFCs under a fixed API scaffold. Each task involves implementing 1,000-10,000 lines of core logic, corresponding to weeks or months of engineering effort for an experienced human developer. By leveraging the nascent MoonBit ecosystem, SWE-AGI minimizes data leakage, forcing agents to rely on long-horizon architectural reasoning rather than code retrieval. Across frontier models, gpt-5.3-codex achieves the best overall performance (solving 19/22 tasks, 86.4%), outperforming claude-opus-4.6 (15/22, 68.2%), and kimi-2.5 exhibits the strongest performance among open-source models. Performance degrades sharply with increasing task difficulty, particularly on hard, specification-intensive systems. Behavioral analysis further reveals that as codebases scale, code reading, rather than writing, becomes the dominant bottleneck in AI-assisted development. Overall, while specification-driven autonomous software engineering is increasingly viable, substantial challenges remain before it can reliably support production-scale development.
Problem

Research questions and friction points this paper is trying to address.

specification-driven software construction
autonomous agents
large language models
production-scale software
MoonBit
Innovation

Methods, ideas, or system contributions that make the work stand out.

specification-driven software construction
MoonBit
autonomous agents
long-horizon reasoning
benchmarking LLMs
🔎 Similar Papers
No similar papers found.
Z
Zhirui Zhang
International Digital Economy Academy
H
Hongbo Zhang
International Digital Economy Academy, The Hong Kong University of Science and Technology
H
Haoxiang Fei
International Digital Economy Academy
Z
Zhiyuan Bao
International Digital Economy Academy
Y
Yubin Chen
International Digital Economy Academy
Z
Zhengyu Lei
International Digital Economy Academy
Ziyue Liu
Ziyue Liu
Ph.D. Student at CS UCSB
Efficient LLM Pre-TrainingScientific Machine Learning
Yixuan Sun
Yixuan Sun
Fudan University
M
Mingkun Xiao
International Digital Economy Academy
Z
Zihang Ye
International Digital Economy Academy
Y
Yu Zhang
International Digital Economy Academy
H
Hongcheng Zhu
International Digital Economy Academy
Y
Yuxiang Wen
International Digital Economy Academy
Heung-Yeung Shum
Heung-Yeung Shum
Microsoft
Computer VisionComputer Graphics