🤖 AI Summary
Automated generation of loop invariants remains a critical bottleneck in program verification. This paper introduces a neuro-symbolic collaborative framework that pioneers the deep integration of Hoare logic’s weakest precondition (WP) reasoning into large language model (LLM) inference. Our method employs LLM-driven backward WP derivation coupled with OpenJML-guided, counterexample-based iterative invariant repair, enabling high-reliability invariant synthesis. By closing the verification loop, it unifies formal rigor with data-driven adaptability. Evaluated on 150 Java benchmarks, our approach achieves a 99.5% success rate. Moreover, on a challenging suite of 10 benchmarks—each containing an average of seven nested or parallel loops—it significantly outperforms state-of-the-art methods, demonstrating both effectiveness and scalability.
📝 Abstract
Loop invariant generation remains a critical bottleneck in automated program verification. Recent work has begun to explore the use of Large Language Models (LLMs) in this area, yet these approaches tend to lack a reliable and structured methodology, with little reference to existing program verification theory. This paper presents NeuroInv, a neurosymbolic approach to loop invariant generation. NeuroInv comprises two key modules: (1) a neural reasoning module that leverages LLMs and Hoare logic to derive and refine candidate invariants via backward-chaining weakest precondition reasoning, and (2) a verification-guided symbolic module that iteratively repairs invariants using counterexamples from OpenJML. We evaluate NeuroInv on a comprehensive benchmark of 150 Java programs, encompassing single and multiple (sequential) loops, multiple arrays, random branching, and noisy code segments. NeuroInv achieves a $99.5%$ success rate, substantially outperforming the other evaluated approaches. Additionally, we introduce a hard benchmark of $10$ larger multi-loop programs (with an average of $7$ loops each); NeuroInv's performance in this setting demonstrates that it can scale to more complex verification scenarios.