RealBench: A Repo-Level Code Generation Benchmark Aligned with Real-World Software Development Practices

📅 2026-04-24
📈 Citations: 0
Influential: 0
📄 PDF

career value

190K/year
🤖 AI Summary
Existing code generation benchmarks diverge from real-world industrial development workflows, limiting their ability to accurately assess the automated coding capabilities of large language models. This work proposes RealBench, the first repository-level code generation benchmark that integrates natural language requirements with UML system design diagrams, closely mirroring industry practices based on structured specifications. By systematically comparing holistic versus modular code generation strategies, the study evaluates the performance of prominent large language models. Experimental results reveal a significant performance drop in repository-scale tasks: while models can recognize UML components, the generated code frequently contains syntactic and logical errors. Holistic generation proves effective for small repositories, whereas modular approaches yield better results for complex systems. This work pioneers the incorporation of UML into code generation evaluation, uncovering critical capability gaps in system-level code synthesis.

Technology Category

Application Category

📝 Abstract
Writing code requires significant time and effort in software development. To automate this process, researchers have made substantial progress using Large Language Models (LLMs) for code generation. Many benchmarks like HumanEval and EvoCodeBench have been created to evaluate LLMs by requiring them to generate code from natural language requirements. However, in enterprise applications and team development, developers typically write code based on structured designs or specifications rather than raw natural language descriptions. This gap between existing benchmarks and real industry development practices means that current benchmark scores may not accurately reflect how much code generation can help automate software development tasks. To address this gap, we propose RealBench, a repository-level code generation benchmark aligned with real-world industry software development practices. Each example includes both natural language requirements and UML diagrams as system design, matching how developers typically receive specifications. Based on the constructed benchmarks, we conduct a systematic evaluation of advanced LLMs' code generation capabilities when provided with structured system designs. The experimental results reveal key insights in current LLMs' capabilities for repo-level code generation aligned with real-world software development practices. First, we notice that regarding repo-level code generation, LLMs show much worse performance and there are significant performance gaps among LLMs. Second, LLMs are good at finding and creating modules defined in UML diagrams, but the quality of generated modules is often poor due to grammar and logic errors. Third, generating the entire repository at once is the best generation strategy on smaller repositories, while generating a complex repository with the module-by-module strategy works better compared to other strategies.
Problem

Research questions and friction points this paper is trying to address.

code generation
software development practices
LLM evaluation
repository-level
structured specifications
Innovation

Methods, ideas, or system contributions that make the work stand out.

RealBench
repository-level code generation
UML diagrams
structured system design
LLM evaluation
J
Jia Li
Wuhan University, China
H
Hongyi Deng
Peking University, China
Yiran Zhang
Yiran Zhang
Nanyang Technological University
Software ArchitectureReverse EngineeringProgram Comprehension
Kechi Zhang
Kechi Zhang
Peking University
AI4SE
T
Tianqi Shao
Key Lab of High Confidence Software Technology (PKU), Ministry of Education; School of Computer Science, Peking University, China
T
Tiankuo Zhao
Wuhan University, China
W
Weinan Wang
Peking University, China
Zhi Jin
Zhi Jin
Sun Yat-Sen University, Associate Professor
Ge Li
Ge Li
Full Professor of Computer Science, Peking University
Program AnalysisProgram GenerationDeep Learning
Yang Liu
Yang Liu
Nanyang Technological University
AgentSoftware EngineeringCyber SecurityTrustworthy AISoftware Security
Y
Yingtao Fang
Wuhan University, China
Yihong Dong
Yihong Dong
Peking University
Code GenerationLarge Language Models