Agentic Model Checking

📅 2026-05-20
📈 Citations: 0
Influential: 0
📄 PDF

career value

191K/year
🤖 AI Summary
This work addresses the challenge of verifying system code generated by large language models (LLMs), which often lacks formal specifications and encodes safety contracts implicitly. The authors propose a novel verification paradigm in which an LLM agent handles semantic tasks—such as specification inference and checker selection—while a bounded model checker performs correctness judgments. By integrating top-down specification inference, compositional verification, a counterexample-driven pipeline, and techniques including specification encoding in a constrained domain-specific language, function-level isolation, postcondition stubs, and dynamic counterexample replay, the approach enables automatic specification propagation and precise defect classification. Evaluated on LLM-generated kernel and compiler code written in C and Rust, the method successfully uncovers real bugs, achieves bounded defect-free verification for interfaces previously subjected to intensive fuzzing, and establishes functional equivalence at the algorithmic function level.
📝 Abstract
Verifying LLM-generated systems code is hard: bugs are prevalent, formal specifications are missing, and safety contracts are encoded implicitly at call sites rather than enforced at function boundaries. We propose agentic model checking, a paradigm that couples LLM agents with a bounded model checking backend under the principle agents propose, solvers verify: agents handle tasks requiring semantic judgment (spec inference, check selection, counterexample classification, refinement proposal) while BMC discharges every soundness-relevant decision. The paradigm rests on three commitments. Specifications are inferred top-down from caller context in a restricted DSL that translates deterministically into the backend's assume/assert primitives, with optional functional-correctness clauses lifting verification from panic-freeness to behavioural faithfulness. Verification is compositional: each function is checked in isolation against its spec with callees replaced by postcondition-constrained stubs, so per-query cost scales with a single function's state space and refinements propagate automatically to callers. Counterexamples are not bug reports: they pass through a validation pipeline (reachability, callee feasibility, dynamic replay, realism audit) that distinguishes active in-tree crashes from latent public-API failures, while modelling artifacts drive a refinement loop rather than being suppressed. We instantiate the approach in BMC-Agent and evaluate it on LLM-generated kernel and compiler code in C and Rust alongside mature OSS-Fuzz-hardened libraries, confirming real defects, producing bounded clean verifications on heavily-fuzzed surfaces, and establishing functional equivalence on selected algorithmic functions.
Problem

Research questions and friction points this paper is trying to address.

LLM-generated code
formal verification
specification inference
bounded model checking
system code
Innovation

Methods, ideas, or system contributions that make the work stand out.

agentic model checking
bounded model checking
specification inference
compositional verification
counterexample validation