🤖 AI Summary
Automatically synthesizing strong loop invariants remains a fundamental challenge in program verification. This paper introduces the first formal, correctness-guaranteed evaluation framework for LLM-generated invariants, explicitly designed to assess both functional correctness and verification acceleration. Methodologically, we propose a verifier-driven decision procedure integrating supervised fine-tuning (SFT) and Best-of-N sampling, systematically evaluating seven state-of-the-art LLMs and comparing them against the traditional verifier UAutomizer. Our key contributions are: (1) the first benchmark for LLM-based invariant synthesis targeting verification speedup; (2) empirical evidence that model capability critically impacts verification efficiency; and (3) quantitative improvements—SFT increases acceleration cases for Qwen3-Coder-480B from 8% to 29.2%, while Best-of-N raises them for Claude-sonnet-4 from 8.8% to 22.1%. Overall, results indicate that current LLMs still fall short of consistently outperforming classical invariant generators.
📝 Abstract
Program verification relies on loop invariants, yet automatically discovering strong invariants remains a long-standing challenge. We introduce a principled framework for evaluating LLMs on invariant synthesis. Our approach uses a verifier-based decision procedure with a formal soundness guarantee and assesses not only correctness but also the speedup that invariants provide in verification. We evaluate 7 state-of-the-art LLMs, and existing LLM-based verifiers against the traditional solver UAutomizer. While LLM-based verifiers represent a promising direction, they do not yet offer a significant advantage over UAutomizer. Model capability also proves critical, as shown by sharp differences in speedups across models, and our benchmark remains an open challenge for current LLMs. Finally, we show that supervised fine-tuning and Best-of-N sampling can improve performance: fine-tuning on 3589 instances raises the percentage of speedup cases for Qwen3-Coder-480B from 8% to 29.2%, and Best-of-N sampling with N=16 improves Claude-sonnet-4 from 8.8% to 22.1%.