🤖 AI Summary
Existing code generation benchmarks are nearing saturation and fail to adequately assess large language models’ (LLMs) deep reasoning capabilities—particularly in formal verification. Method: We introduce VerifyThisBench, the first end-to-end program verification benchmark requiring joint generation of executable code, formal specifications, and machine-checkable proofs for natural-language problems. To isolate verification competence, we further propose VerifyThisBenchXS, a lightweight variant. Our evaluation framework integrates multi-stage LLM-driven formal reasoning, incorporating compilation checking, logical verification, and specification-consistency assessment. Contribution/Results: Experiments reveal that state-of-the-art LLMs achieve less than 4% pass rate on the full benchmark; over half of generated outputs fail to compile. These results expose fundamental limitations in current LLMs’ ability to ensure program correctness, underscoring the need for verification-aware modeling and evaluation paradigms.
📝 Abstract
Large language models (LLMs) have demonstrated remarkable progress in code generation, but many existing benchmarks are approaching saturation and offer little guarantee on the trustworthiness of the generated programs, offering limited insight into deeper reasoning capabilities. We introduce VerifyThisBench, a new benchmark designed to evaluate LLMs on end-to-end program verification tasks that require interpreting natural language problem descriptions, formulating formal specifications, generating code, and constructing correctness proofs. Our evaluation reveals that even state-of-the-art (SOTA) models, such as o3-mini, achieve a pass rate of less than 4%, with many outputs failing to compile. To reduce task complexity, we further propose VerifyThisBenchXS, a variant in which partial implementations or proofs are provided. We systematically assess SOTA models on both benchmarks, uncovering key strengths and limitations in their formal reasoning and verification capabilities.