BaxBench: Can LLMs Generate Correct and Secure Backends?

📅 2025-02-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the lack of rigorous evaluation for large language models’ (LLMs) capability to generate production-grade backend applications. We introduce BaxBench—the first benchmark tailored to secure, multi-file, multi-function web/cloud backend development—comprising 392 tasks spanning mainstream (e.g., Express, FastAPI, Flask) and niche frameworks. Methodologically, we propose an end-to-end evaluation framework integrating functional correctness testing with automated vulnerability exploitation. Our key contributions are: (1) the first systematic assessment of LLMs’ ability to generate complete, functionally correct, and secure backend modules; (2) a novel joint evaluation paradigm combining functional correctness and runtime security exposure; and (3) empirical findings revealing that state-of-the-art models achieve only ~60% functional correctness, over 50% of ostensibly “correct” implementations contain exploitable vulnerabilities, and performance degrades markedly on niche frameworks. BaxBench serves as a critical diagnostic tool for advancing secure autonomous software development.

Technology Category

Application Category

📝 Abstract
The automatic generation of programs has long been a fundamental challenge in computer science. Recent benchmarks have shown that large language models (LLMs) can effectively generate code at the function level, make code edits, and solve algorithmic coding tasks. However, to achieve full automation, LLMs should be able to generate production-quality, self-contained application modules. To evaluate the capabilities of LLMs in solving this challenge, we introduce BaxBench, a novel evaluation benchmark consisting of 392 tasks for the generation of backend applications. We focus on backends for three critical reasons: (i) they are practically relevant, building the core components of most modern web and cloud software, (ii) they are difficult to get right, requiring multiple functions and files to achieve the desired functionality, and (iii) they are security-critical, as they are exposed to untrusted third-parties, making secure solutions that prevent deployment-time attacks an imperative. BaxBench validates the functionality of the generated applications with comprehensive test cases, and assesses their security exposure by executing end-to-end exploits. Our experiments reveal key limitations of current LLMs in both functionality and security: (i) even the best model, OpenAI o1, achieves a mere 60% on code correctness; (ii) on average, we could successfully execute security exploits on more than half of the correct programs generated by each LLM; and (iii) in less popular backend frameworks, models further struggle to generate correct and secure applications. Progress on BaxBench signifies important steps towards autonomous and secure software development with LLMs.
Problem

Research questions and friction points this paper is trying to address.

Evaluate LLMs in backend generation
Assess code correctness and security
Identify limitations in secure software development
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLMs generate backend applications
BaxBench tests functionality and security
Evaluates LLMs on 392 tasks