🤖 AI Summary
This work addresses the challenge of verifying system code generated by large language models (LLMs), which often lacks formal specifications and encodes safety contracts implicitly. The authors propose a novel verification paradigm in which an LLM agent handles semantic tasks—such as specification inference and checker selection—while a bounded model checker performs correctness judgments. By integrating top-down specification inference, compositional verification, a counterexample-driven pipeline, and techniques including specification encoding in a constrained domain-specific language, function-level isolation, postcondition stubs, and dynamic counterexample replay, the approach enables automatic specification propagation and precise defect classification. Evaluated on LLM-generated kernel and compiler code written in C and Rust, the method successfully uncovers real bugs, achieves bounded defect-free verification for interfaces previously subjected to intensive fuzzing, and establishes functional equivalence at the algorithmic function level.
📝 Abstract
Verifying LLM-generated systems code is hard: bugs are prevalent, formal specifications are missing, and safety contracts are encoded implicitly at call sites rather than enforced at function boundaries. We propose agentic model checking, a paradigm that couples LLM agents with a bounded model checking backend under the principle agents propose, solvers verify: agents handle tasks requiring semantic judgment (spec inference, check selection, counterexample classification, refinement proposal) while BMC discharges every soundness-relevant decision. The paradigm rests on three commitments. Specifications are inferred top-down from caller context in a restricted DSL that translates deterministically into the backend's assume/assert primitives, with optional functional-correctness clauses lifting verification from panic-freeness to behavioural faithfulness. Verification is compositional: each function is checked in isolation against its spec with callees replaced by postcondition-constrained stubs, so per-query cost scales with a single function's state space and refinements propagate automatically to callers. Counterexamples are not bug reports: they pass through a validation pipeline (reachability, callee feasibility, dynamic replay, realism audit) that distinguishes active in-tree crashes from latent public-API failures, while modelling artifacts drive a refinement loop rather than being suppressed. We instantiate the approach in BMC-Agent and evaluate it on LLM-generated kernel and compiler code in C and Rust alongside mature OSS-Fuzz-hardened libraries, confirming real defects, producing bounded clean verifications on heavily-fuzzed surfaces, and establishing functional equivalence on selected algorithmic functions.