Can Large Language Models Model Programs Formally?

📅 2026-04-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limited capability of large language models (LLMs) in automatically translating programs into formal specifications suitable for model checking, a key bottleneck in their application to software verification. To systematically evaluate and advance this capability, the authors introduce Model-Bench, the first benchmark specifically designed for program-to-formal-model translation. Constructed from HumanEval, MBPP, and LiveCodeBench, Model-Bench comprises 400 Python programs paired with reference formal specifications. Leveraging a pipeline that integrates LLMs, program modeling, and model-checking techniques, the study assesses the ability of current models to generate verifiable specifications. Experimental results uncover critical limitations in existing LLMs for this task and provide clear directions for future improvements in bridging the gap between natural-language-driven code generation and formal verification.
📝 Abstract
In the digital age, ensuring the correctness, safety, and reliability of software through formal verification is paramount, particularly as software increasingly underpins critical infrastructure. Formal verification, split into theorem proving and model checking, provides a feasible and reliable path. Unlike theorem proving, which yields notable advances, model checking has been less focused due to the difficulty of automatic program modeling. To fill this gap, we introduce Model-Bench, a benchmark and an accompanying pipeline for evaluating and improving LLMs' program modeling capability by modeling Python programs into verification-ready model checking specifications checkable by its accompanying model checker. Model-Bench comprises 400 Python programs derived from three well-known benchmarks (HumanEval, MBPP, and LiveCodeBench). Our extensive experiments reveal significant limitations in LLMs' program modeling and further provide inspiring directions.
Problem

Research questions and friction points this paper is trying to address.

Large Language Models
Formal Verification
Model Checking
Program Modeling
Software Correctness
Innovation

Methods, ideas, or system contributions that make the work stand out.

Model-Bench
program modeling
formal verification
model checking
large language models
🔎 Similar Papers
No similar papers found.