AlgoVeri: An Aligned Benchmark for Verified Code Generation on Classical Algorithms

📅 2026-02-10

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

This work addresses the lack of cross-paradigm evaluation capabilities in existing AI code generation benchmarks, which struggle to assess model performance in vericoding—formal verification-aware code generation. The authors introduce AlgoVeri, a novel benchmark comprising 77 classical algorithms, uniformly aligned via functional specifications across three verification languages: Dafny, Verus, and Lean. AlgoVeri constitutes the first systematic evaluation framework supporting multi-language formal verification code generation. Empirical results reveal a pronounced impact of verification language design on model efficacy: Gemini-3 Flash achieves a 40.3% success rate in Dafny, but this drops to 24.7% in Verus and further to 7.8% in Lean. Additionally, iterative repair proves highly effective in Dafny, while models exhibit markedly distinct debugging behaviors across the three languages.

Technology Category

Application Category

📝 Abstract

Vericoding refers to the generation of formally verified code from rigorous specifications. Recent AI models show promise in vericoding, but a unified methodology for cross-paradigm evaluation is lacking. Existing benchmarks test only individual languages/tools (e.g., Dafny, Verus, and Lean) and each covers very different tasks, so the performance numbers are not directly comparable. We address this gap with AlgoVeri, a benchmark that evaluates vericoding of $77$ classical algorithms in Dafny, Verus, and Lean. By enforcing identical functional contracts, AlgoVeri reveals critical capability gaps in verification systems. While frontier models achieve tractable success in Dafny ($40.3$% for Gemini-3 Flash), where high-level abstractions and SMT automation simplify the workflow, performance collapses under the systems-level memory constraints of Verus ($24.7$%) and the explicit proof construction required by Lean (7.8%). Beyond aggregate metrics, we uncover a sharp divergence in test-time compute dynamics: Gemini-3 effectively utilizes iterative repair to boost performance (e.g., tripling pass rates in Dafny), whereas GPT-OSS saturates early. Finally, our error analysis shows that language design affects the refinement trajectory: while Dafny allows models to focus on logical correctness, Verus and Lean trap models in persistent syntactic and semantic barriers. All data and evaluation code can be found at https://github.com/haoyuzhao123/algoveri.

Problem

Research questions and friction points this paper is trying to address.

vericoding

formal verification

code generation

benchmark

classical algorithms

Innovation

Methods, ideas, or system contributions that make the work stand out.

verified code generation

formal verification benchmark

cross-paradigm evaluation