VerifyThisBench: Generating Code, Specifications, and Proofs All at Once

📅 2025-05-25

📈 Citations: 0

✨ Influential: 0

career value

162K/year

🤖 AI Summary

Existing code generation benchmarks are nearing saturation and fail to adequately assess large language models’ (LLMs) deep reasoning capabilities—particularly in formal verification. Method: We introduce VerifyThisBench, the first end-to-end program verification benchmark requiring joint generation of executable code, formal specifications, and machine-checkable proofs for natural-language problems. To isolate verification competence, we further propose VerifyThisBenchXS, a lightweight variant. Our evaluation framework integrates multi-stage LLM-driven formal reasoning, incorporating compilation checking, logical verification, and specification-consistency assessment. Contribution/Results: Experiments reveal that state-of-the-art LLMs achieve less than 4% pass rate on the full benchmark; over half of generated outputs fail to compile. These results expose fundamental limitations in current LLMs’ ability to ensure program correctness, underscoring the need for verification-aware modeling and evaluation paradigms.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) have demonstrated remarkable progress in code generation, but many existing benchmarks are approaching saturation and offer little guarantee on the trustworthiness of the generated programs, offering limited insight into deeper reasoning capabilities. We introduce VerifyThisBench, a new benchmark designed to evaluate LLMs on end-to-end program verification tasks that require interpreting natural language problem descriptions, formulating formal specifications, generating code, and constructing correctness proofs. Our evaluation reveals that even state-of-the-art (SOTA) models, such as o3-mini, achieve a pass rate of less than 4%, with many outputs failing to compile. To reduce task complexity, we further propose VerifyThisBenchXS, a variant in which partial implementations or proofs are provided. We systematically assess SOTA models on both benchmarks, uncovering key strengths and limitations in their formal reasoning and verification capabilities.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs on end-to-end program verification tasks

Assessing formal reasoning and verification capabilities of SOTA models

Reducing task complexity with partial implementations or proofs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces VerifyThisBench for end-to-end verification

Proposes VerifyThisBenchXS to reduce task complexity

Evaluates SOTA models on formal reasoning capabilities

🔎 Similar Papers

AutoVerus: Automated Proof Generation for Rust Code