InvBench: Can LLMs Accelerate Program Verification with Invariant Synthesis?

📅 2025-09-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Automatically synthesizing strong loop invariants remains a fundamental challenge in program verification. This paper introduces the first formal, correctness-guaranteed evaluation framework for LLM-generated invariants, explicitly designed to assess both functional correctness and verification acceleration. Methodologically, we propose a verifier-driven decision procedure integrating supervised fine-tuning (SFT) and Best-of-N sampling, systematically evaluating seven state-of-the-art LLMs and comparing them against the traditional verifier UAutomizer. Our key contributions are: (1) the first benchmark for LLM-based invariant synthesis targeting verification speedup; (2) empirical evidence that model capability critically impacts verification efficiency; and (3) quantitative improvements—SFT increases acceleration cases for Qwen3-Coder-480B from 8% to 29.2%, while Best-of-N raises them for Claude-sonnet-4 from 8.8% to 22.1%. Overall, results indicate that current LLMs still fall short of consistently outperforming classical invariant generators.

Technology Category

Application Category

📝 Abstract
Program verification relies on loop invariants, yet automatically discovering strong invariants remains a long-standing challenge. We introduce a principled framework for evaluating LLMs on invariant synthesis. Our approach uses a verifier-based decision procedure with a formal soundness guarantee and assesses not only correctness but also the speedup that invariants provide in verification. We evaluate 7 state-of-the-art LLMs, and existing LLM-based verifiers against the traditional solver UAutomizer. While LLM-based verifiers represent a promising direction, they do not yet offer a significant advantage over UAutomizer. Model capability also proves critical, as shown by sharp differences in speedups across models, and our benchmark remains an open challenge for current LLMs. Finally, we show that supervised fine-tuning and Best-of-N sampling can improve performance: fine-tuning on 3589 instances raises the percentage of speedup cases for Qwen3-Coder-480B from 8% to 29.2%, and Best-of-N sampling with N=16 improves Claude-sonnet-4 from 8.8% to 22.1%.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' ability to synthesize loop invariants for program verification
Assessing whether LLM-generated invariants accelerate verification processes
Benchmarking LLM performance against traditional verification tools like UAutomizer
Innovation

Methods, ideas, or system contributions that make the work stand out.

Verifier-based decision procedure with formal soundness guarantee
Evaluates invariant correctness and verification speedup
Uses supervised fine-tuning and Best-of-N sampling
🔎 Similar Papers
No similar papers found.