A benchmark for vericoding: formally verified program synthesis

📅 2025-09-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the lack of systematic evaluation benchmarks for formal verification–guided program synthesis (vericoding). We introduce the largest vericoding benchmark to date, comprising 12,504 formal specifications across three verification languages: Dafny, Verus/Rust, and Lean. Crucially, we propose and open-source 6,174 novel, unseen problems—strictly free of natural language descriptions—to enable pure specification-driven code generation and verification. Experimental results show that natural language prompts yield negligible improvement in verification success; Dafny achieves a 96% verification pass rate (a 28-percentage-point gain within one year), while Verus/Rust and Lean attain 44% and 27%, respectively. The benchmark enables reproducible, rigorous evaluation of large language models for reliable, formally verified program synthesis, thereby advancing research in formal AI programming.

Technology Category

Application Category

📝 Abstract
We present and test the largest benchmark for vericoding, LLM-generation of formally verified code from formal specifications - in contrast to vibe coding, which generates potentially buggy code from a natural language description. Our benchmark contains 12,504 formal specifications, with 3,029 in Dafny, 2,334 in Verus/Rust and 7,141 in Lean. Of these, 6,174 are new unseen problems. We find vericoding success rates of 27% in Lean, 44% in Verus/Rust and 82% in Dafny using off-the-shelf LLMs. Adding natural-language descriptions does not significantly improve performance. We also find that LLM progress has improved progress on pure Dafny verification from 68% to 96% over the past year. The benchmark and vericoding results are shared at https://github.com/Beneficial-AI-Foundation/vericoding-benchmark
Problem

Research questions and friction points this paper is trying to address.

Largest benchmark for formally verified code synthesis
Evaluates LLM generation from formal specifications
Tests vericoding success across multiple verification languages
Innovation

Methods, ideas, or system contributions that make the work stand out.

Largest benchmark for vericoding with 12,504 specifications
Tests LLM generation of formally verified code
Evaluates three languages: Dafny, Verus/Rust, Lean
🔎 Similar Papers
No similar papers found.
Sergiu Bursuc
Sergiu Bursuc
Beneficial AI Foundation
T
Theodore Ehrenborg
Beneficial AI Foundation
Shaowei Lin
Shaowei Lin
Beneficial AI Foundation
L
Lacramioara Astefanoaei
Beneficial AI Foundation
I
Ionel Emilian Chiosa
Massachusetts Institute of Technology
J
Jure Kukovec
Beneficial AI Foundation
A
Alok Singh
Beneficial AI Foundation
O
Oliver Butterley
Beneficial AI Foundation
A
Adem Bizid
Massachusetts Institute of Technology
Q
Quinn Dougherty
Beneficial AI Foundation
M
Miranda Zhao
Massachusetts Institute of Technology
M
Max Tan
Massachusetts Institute of Technology
Max Tegmark
Max Tegmark
Professor of Physics, MIT
Physics