🤖 AI Summary
This study addresses the mismatch between current program verification benchmarks and the capabilities of state-of-the-art verifiers. To bridge this gap, we introduce the first end-to-end agent-based theorem-proving framework that integrates the Claude Code large language model with the Lean 4 formal environment, enabling fully automated specification generation, code certification, and self-feedback debugging on the CLEVER benchmark. Our work presents the first systematic evaluation of intelligent agents in program verification, revealing limitations in existing isomorphism-based specification scoring methods and proposing a more robust evaluation paradigm. Experimental results demonstrate a 98.8% effectiveness rate in specification generation, an 87.5% verification success rate under correct specifications, and an end-to-end success rate of 98.1%, with precise identification of failure causes and dataset deficiencies.
📝 Abstract
Agentic systems have recently emerged as state-of-the-art approaches for automated theorem proving in formal mathematics. To assess how far these capabilities extend to program verification, we evaluate Claude Code in an agentic proving framework on CLEVER, a Lean 4 benchmark for verifiable code generation. Our results show that Claude generates arguably valid specifications for 98.8% of problems (with 81.3% also accepted by CLEVER's isomorphism-based scoring on the correct portion of the benchmark), certifies implementations against correct ground-truth specifications for 87.5% of problems, and reaches a 98.1% success rate on the end-to-end program generation and verification pipeline over entries with self-consistent premises. Across all stages, Claude further provides high-quality feedback on its own attempts (as confirmed under manual review), identifying underlying causes of failure and lingering bugs in the dataset. These findings highlight a growing mismatch between the difficulty of existing program verification benchmarks and the capabilities of modern agentic provers, and point to the need for more rigorous, bug-resilient evaluation methodologies, and in particular for alternatives to isomorphism-based scoring of generated specifications. More broadly, our results provide empirical evidence that tight compiler-in-the-loop agentic paradigms are currently the most effective approach for foundational program verification.