Stepwise: Neuro-Symbolic Proof Search for Automated Systems Verification

📅 2026-03-20

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

This work addresses the inefficiency and poor scalability of manually authored proof scripts in system-level formal verification by proposing a neurosymbolic framework for automated proof search. The approach integrates a fine-tuned large language model with an interactive theorem prover—specifically Isabelle/Sledgehammer—and employs a semantics-aware best-first tree search to generate and filter candidate proof steps. This enables data-efficient model fine-tuning and effective pruning of the search space. Evaluated on the seL4 benchmark, the method achieves a proof success rate of 77.6%, substantially outperforming existing large language model–based approaches and standalone Sledgehammer, while demonstrating strong generalization across multiple Isabelle projects.

Technology Category

Application Category

📝 Abstract

Formal verification via interactive theorem proving is increasingly used to ensure the correctness of critical systems, yet constructing large proof scripts remains highly manual and limits scalability. Advances in large language models (LLMs), especially in mathematical reasoning, make their integration into software verification increasingly promising. This paper introduces a neuro-symbolic proof generation framework designed to automate proof search for systems-level verification projects. The framework performs a best-first tree search over proof states, repeatedly querying an LLM for the next candidate proof step. On the neural side, we fine-tune LLMs using datasets of proof state-step pairs; on the symbolic side, we incorporate a range of ITP tools to repair rejected steps, filter and rank proof states, and automatically discharge subgoals when search progress stalls. This synergy enables data-efficient LLM adaptation and semantics-informed pruning of the search space. We implement the framework on a new Isabelle REPL that exposes fine-grained proof states and automation tools, and evaluate it on the FVEL seL4 benchmark and additional Isabelle developments. On seL4, the system proves up to 77.6\% of the theorems, substantially surpassing previous LLM-based approaches and standalone Sledgehammer, while solving significantly more multi-step proofs. Results across further benchmarks demonstrate strong generalization, indicating a viable path toward scalable automated software verification.

Problem

Research questions and friction points this paper is trying to address.

automated theorem proving

formal verification

large language models

proof automation

interactive theorem proving

Innovation

Methods, ideas, or system contributions that make the work stand out.

neuro-symbolic

automated theorem proving

large language models