VeriEquivBench: An Equivalence Score for Ground-Truth-Free Evaluation of Formally Verifiable Code

📅 2025-10-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Formal verification of code generated by large language models (LLMs) is severely hindered by the absence of ground-truth formal specifications, resulting in small-scale benchmarks (hundreds of trivial problems), high construction costs, and low reliability. Method: We propose VeriEquivBench—the first large-scale, formal-verification benchmark comprising 2,389 complex algorithmic problems—and introduce Equivalence Score, a novel, ground-truth-free metric for assessing logical equivalence between generated code and formal specifications, automated via toolchains such as Dafny. Contribution/Results: Our approach eliminates reliance on manual annotation, enabling scalable, reproducible evaluation. Experiments reveal substantial limitations in current state-of-the-art LLMs’ ability to generate formally verifiable code. VeriEquivBench effectively exposes critical model deficiencies and establishes a new paradigm and rigorous evaluation foundation for developing reliable programming agents.

Technology Category

Application Category

📝 Abstract
Formal verification is the next frontier for ensuring the correctness of code generated by Large Language Models (LLMs). While methods that co-generate code and formal specifications in formal languages, like Dafny, can, in principle, prove alignment with user intent, progress is bottlenecked by specification quality evaluation. Current benchmarks rely on matching against ground-truth specifications, a manual and expertise-intensive process that has limited existing datasets to a few hundred simple problems and also suffers from a reliability issue. To address this, we introduce VeriEquivBench, a new benchmark with $2,389$ complex algorithmic problems that probe the limitations of current models in both code generation and formal reasoning. Our evaluation framework replaces ground-truth matching with a formally grounded metric, the equivalence score, and rigorously verifies the quality of generated specifications and code. Our results show that generating formally verifiable code remains a profound challenge for state-of-the-art LLMs. This underscores both the difficulty of the task and the need for benchmarks like VeriEquivBench to drive progress toward scalable and reliable coding agents.
Problem

Research questions and friction points this paper is trying to address.

Evaluating formal specification quality without ground-truth references
Assessing equivalence of formally verifiable code and specifications
Benchmarking LLMs on complex algorithmic code verification tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces VeriEquivBench benchmark with 2,389 complex problems
Uses equivalence score instead of ground-truth specifications
Rigorously verifies generated specifications and code quality
🔎 Similar Papers
No similar papers found.
L
Lingfei Zeng
Huazhong University of Science and Technology
Fengdi Che
Fengdi Che
university of alberta
artificial intelligence
Xuhan Huang
Xuhan Huang
The Chinese University of Hong Kong, Shenzhen
Machine learning
F
Fei Ye
Jilin University
X
Xu Xu
Hong Kong University of Science and Technology
B
Binhang Yuan
Hong Kong University of Science and Technology
J
Jie Fu
Shanghai Artificial Intelligence Laboratory