🤖 AI Summary
This study addresses the limitations of existing code generation benchmarks in comprehensively evaluating large language models’ ability to satisfy both functional correctness and non-functional quality attributes—such as maintainability and security—in application-level scenarios. To this end, the authors propose the first systematic evaluation framework grounded in the ISO/IEC 25010 standard. The framework extracts natural language requirements from high-quality open-source projects and designs black-box, end-to-end tests covering functional behavior and five key non-functional dimensions. Metrics are aggregated using the Analytic Hierarchy Process (AHP) with weights derived from expert judgment, and all tests are validated against reference implementations to ensure benchmark reliability. In zero-shot evaluations across 16 prominent large language models, functional correctness emerged as the primary bottleneck, with no model achieving a functional pass rate above 45% under requirement-driven, reference-validated testing.
📝 Abstract
Code generation has advanced rapidly with code-focused large language models (LLMs), especially on snippet-level tasks. However, application-level generation requires producing a runnable multi-file repository with correct structure, dependencies, and end-to-end executability, and real-world software must satisfy both functional correctness and non-functional quality (e.g., maintainability, security). Existing benchmarks provide a limited execution-based assessment of these requirements at the application level. We ask: Can current LLMs generate application-level repositories that meet both functional and non-functional criteria? We propose RAL-Bench, a benchmark and evaluation framework for application-level code generation. For each task, we distill a concise natural-language requirement from a high-quality reference project, build black-box system tests covering functional and non-functional attributes, and keep only tests that pass on the reference repository to ensure a sound oracle and an end-to-end executable suite. Functional correctness is measured by system-test pass rate. Non-functional quality is measured along five ISO/IEC 25010-inspired dimensions and aggregated with an Analytic Hierarchy Process (AHP)-derived weight vector, with per-dimension diagnostics and baseline-normalized scoring using reference measurements. Across 16 LLMs evaluated zero-shot with greedy decoding, functional correctness is the dominant bottleneck: no model exceeds a 45% functional pass rate under our requirement-driven, reference-validated tests. We release RAL-Bench at https://github.com/Wwstarry/RAL-Bench. .