🤖 AI Summary
This work addresses the lack of cross-paradigm evaluation capabilities in existing AI code generation benchmarks, which struggle to assess model performance in vericoding—formal verification-aware code generation. The authors introduce AlgoVeri, a novel benchmark comprising 77 classical algorithms, uniformly aligned via functional specifications across three verification languages: Dafny, Verus, and Lean. AlgoVeri constitutes the first systematic evaluation framework supporting multi-language formal verification code generation. Empirical results reveal a pronounced impact of verification language design on model efficacy: Gemini-3 Flash achieves a 40.3% success rate in Dafny, but this drops to 24.7% in Verus and further to 7.8% in Lean. Additionally, iterative repair proves highly effective in Dafny, while models exhibit markedly distinct debugging behaviors across the three languages.
📝 Abstract
Vericoding refers to the generation of formally verified code from rigorous specifications. Recent AI models show promise in vericoding, but a unified methodology for cross-paradigm evaluation is lacking. Existing benchmarks test only individual languages/tools (e.g., Dafny, Verus, and Lean) and each covers very different tasks, so the performance numbers are not directly comparable. We address this gap with AlgoVeri, a benchmark that evaluates vericoding of $77$ classical algorithms in Dafny, Verus, and Lean. By enforcing identical functional contracts, AlgoVeri reveals critical capability gaps in verification systems. While frontier models achieve tractable success in Dafny ($40.3$% for Gemini-3 Flash), where high-level abstractions and SMT automation simplify the workflow, performance collapses under the systems-level memory constraints of Verus ($24.7$%) and the explicit proof construction required by Lean (7.8%). Beyond aggregate metrics, we uncover a sharp divergence in test-time compute dynamics: Gemini-3 effectively utilizes iterative repair to boost performance (e.g., tripling pass rates in Dafny), whereas GPT-OSS saturates early. Finally, our error analysis shows that language design affects the refinement trajectory: while Dafny allows models to focus on logical correctness, Verus and Lean trap models in persistent syntactic and semantic barriers. All data and evaluation code can be found at https://github.com/haoyuzhao123/algoveri.