Agentic Proving for Program Verification

📅 2026-05-22

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

This study addresses the mismatch between current program verification benchmarks and the capabilities of state-of-the-art verifiers. To bridge this gap, we introduce the first end-to-end agent-based theorem-proving framework that integrates the Claude Code large language model with the Lean 4 formal environment, enabling fully automated specification generation, code certification, and self-feedback debugging on the CLEVER benchmark. Our work presents the first systematic evaluation of intelligent agents in program verification, revealing limitations in existing isomorphism-based specification scoring methods and proposing a more robust evaluation paradigm. Experimental results demonstrate a 98.8% effectiveness rate in specification generation, an 87.5% verification success rate under correct specifications, and an end-to-end success rate of 98.1%, with precise identification of failure causes and dataset deficiencies.

📝 Abstract

Agentic systems have recently emerged as state-of-the-art approaches for automated theorem proving in formal mathematics. To assess how far these capabilities extend to program verification, we evaluate Claude Code in an agentic proving framework on CLEVER, a Lean 4 benchmark for verifiable code generation. Our results show that Claude generates arguably valid specifications for 98.8% of problems (with 81.3% also accepted by CLEVER's isomorphism-based scoring on the correct portion of the benchmark), certifies implementations against correct ground-truth specifications for 87.5% of problems, and reaches a 98.1% success rate on the end-to-end program generation and verification pipeline over entries with self-consistent premises. Across all stages, Claude further provides high-quality feedback on its own attempts (as confirmed under manual review), identifying underlying causes of failure and lingering bugs in the dataset. These findings highlight a growing mismatch between the difficulty of existing program verification benchmarks and the capabilities of modern agentic provers, and point to the need for more rigorous, bug-resilient evaluation methodologies, and in particular for alternatives to isomorphism-based scoring of generated specifications. More broadly, our results provide empirical evidence that tight compiler-in-the-loop agentic paradigms are currently the most effective approach for foundational program verification.

Problem

Research questions and friction points this paper is trying to address.

program verification

agentic proving

benchmark evaluation

specification scoring

isomorphism-based scoring

Innovation

Methods, ideas, or system contributions that make the work stand out.

agentic proving

program verification

compiler-in-the-loop