VERINA: Benchmarking Verifiable Code Generation

📅 2025-05-29

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work investigates the capability of large language models (LLMs) in verifiable code generation—i.e., jointly generating correct executable code, formal specifications, and machine-checkable mathematical proofs to ensure end-to-end correctness. To this end, we introduce Verina, the first end-to-end benchmark for this task: built atop the Lean theorem prover, it comprises 189 manually curated problems, each accompanied by natural-language problem statements, reference implementations, formal specifications, and comprehensive test suites. We formally define and quantify “completeness” as the joint generation of code, specification, and proof, and propose a modular, multi-granular, and reproducible evaluation protocol. Experimental results reveal fundamental limitations: even the state-of-the-art model (o4-mini) achieves only 61.4% correct code generation, 51.0% sound specification synthesis, and a mere 3.6% successful proof generation—highlighting a critical gap in LLMs’ ability to perform rigorous formal verification.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) are increasingly integrated in software development, but ensuring correctness in LLM-generated code remains challenging and often requires costly manual review. Verifiable code generation -- jointly generating code, specifications, and proofs of code-specification alignment -- offers a promising path to address this limitation and further unleash LLMs' benefits in coding. Yet, there exists a significant gap in evaluation: current benchmarks often lack support for end-to-end verifiable code generation. In this paper, we introduce Verina (Verifiable Code Generation Arena), a high-quality benchmark enabling a comprehensive and modular evaluation of code, specification, and proof generation as well as their compositions. Verina consists of 189 manually curated coding tasks in Lean, with detailed problem descriptions, reference implementations, formal specifications, and extensive test suites. Our extensive evaluation of state-of-the-art LLMs reveals significant challenges in verifiable code generation, especially in proof generation, underscoring the need for improving LLM-based theorem provers in verification domains. The best model, OpenAI o4-mini, generates only 61.4% correct code, 51.0% sound and complete specifications, and 3.6% successful proofs, with one trial per task. We hope Verina will catalyze progress in verifiable code generation by providing a rigorous and comprehensive benchmark. We release our dataset on https://huggingface.co/datasets/sunblaze-ucb/verina and our evaluation code on https://github.com/sunblaze-ucb/verina.

Problem

Research questions and friction points this paper is trying to address.

Ensuring correctness in LLM-generated code requires costly manual review

Current benchmarks lack support for verifiable code generation evaluation

Existing models struggle with proof generation in verification domains

Innovation

Methods, ideas, or system contributions that make the work stand out.

Verina benchmark for verifiable code generation

Jointly generates code, specifications, and proofs

Evaluates LLMs in Lean with 189 tasks

🔎 Similar Papers

Is The Watermarking Of LLM-Generated Code Robust?

2024-03-24Citations: 1

Trustworthy Distributed Certification of Program Execution

2024-02-21arXiv.orgCitations: 0

Authors to Follow