ProgramBench: Can Language Models Rebuild Programs From Scratch?

📅 2026-05-05

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

Existing benchmarks primarily assess language models on localized programming tasks, failing to capture their capability to construct complete software systems from scratch. This work introduces ProgramBench, the first end-to-end evaluation framework grounded in behavioral equivalence, which requires agents to autonomously design and implement full codebases based solely on program specifications and documentation, with correctness verified through behavioral test suites. The benchmark encompasses 200 real-world software tasks—including CLI tools, FFmpeg, and SQLite—supports unconstrained, open-ended code generation, and incorporates agent-driven fuzz testing to automatically produce behavioral test cases. Evaluation across nine prominent language models reveals that none can fully solve any task; the best-performing model passes 95% of tests on only 3% of tasks and tends to generate single-file implementations structurally divergent from human-written code.

📝 Abstract

Turning ideas into full software projects from scratch has become a popular use case for language models. Agents are being deployed to seed, maintain, and grow codebases over extended periods with minimal human oversight. Such settings require models to make high-level software architecture decisions. However, existing benchmarks measure focused, limited tasks such as fixing a single bug or developing a single, specified feature. We therefore introduce ProgramBench to measure the ability of software engineering agents to develop software holisitically. In ProgramBench, given only a program and its documentation, agents must architect and implement a codebase that matches the reference executable's behavior. End-to-end behavioral tests are generated via agent-driven fuzzing, enabling evaluation without prescribing implementation structure. Our 200 tasks range from compact CLI tools to widely used software such as FFmpeg, SQLite, and the PHP interpreter. We evaluate 9 LMs and find that none fully resolve any task, with the best model passing 95\% of tests on only 3\% of tasks. Models favor monolithic, single-file implementations that diverge sharply from human-written code.

Problem

Research questions and friction points this paper is trying to address.

software engineering agents

program synthesis

language models

code generation

benchmarking

Innovation

Methods, ideas, or system contributions that make the work stand out.

ProgramBench

software engineering agents

agent-driven fuzzing