Not All Invariants Are Equal: Curating Training Data to Accelerate Program Verification with SLMs

📅 2026-03-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the critical bottleneck in automated program verification—synthesizing inductive loop invariants—where existing large language models often produce invalid or inefficient candidates on challenging instances. We introduce Wonda, a novel data curation pipeline that formally defines, for the first time, rigorous properties of high-quality invariants and constructs a refined training set through AST normalization and LLM-driven semantic rewriting. A small language model (4B parameters) fine-tuned on this curated data achieves, without any inference-time overhead, a twofold improvement in both correctness and speedup on hard InvBench instances, matching the performance of GPT-OSS-120B and approaching that of GPT-5.2, while boosting the virtual best solver’s verification performance by 14.2%.

Technology Category

Application Category

📝 Abstract
The synthesis of inductive loop invariants is a critical bottleneck in automated program verification. While Large Language Models (LLMs) show promise in mitigating this issue, they often fail on hard instances, generating invariants that are invalid or computationally ineffective. While fine-tuning is a natural route to mitigate this limitation, obtaining high-quality training data for invariant generation remains an open challenge. We present a rigorous data curation pipeline designed to extract high-quality training signals from raw verifier-generated invariants. First, we formalize the properties required for a high-quality training invariant. Second, we propose Wonda, a pipeline that refines noisy data via AST-based normalization, followed by LLM-driven semantic rewriting and augmentation with provable quality guarantees. We demonstrate that fine-tuning Small Language Models (SLMs) on this curated dataset result in consistent and significant performance gain. In particular, a fine-tuned 4B parameter model matches the utility of a GPT-OSS-120B baseline and approaches the state-of-the-art GPT-5.2, without incurring reasoning-time overhead. On challenging instances from the recent InvBench evaluation suite, our approach doubles the invariant correctness and speedup rates of base models; and improves their Virtual Best Performance (VBP) rates on the verification task by up to 14.2%.
Problem

Research questions and friction points this paper is trying to address.

inductive loop invariants
program verification
training data curation
automated reasoning
invariant synthesis
Innovation

Methods, ideas, or system contributions that make the work stand out.

loop invariants
data curation
small language models
program verification
semantic rewriting
🔎 Similar Papers
No similar papers found.