Can Large Language Models Model Programs Formally?

📅 2026-04-02

📈 Citations: 0

✨ Influential: 0

career value

157K/year

🤖 AI Summary

This work addresses the limited capability of large language models (LLMs) in automatically translating programs into formal specifications suitable for model checking, a key bottleneck in their application to software verification. To systematically evaluate and advance this capability, the authors introduce Model-Bench, the first benchmark specifically designed for program-to-formal-model translation. Constructed from HumanEval, MBPP, and LiveCodeBench, Model-Bench comprises 400 Python programs paired with reference formal specifications. Leveraging a pipeline that integrates LLMs, program modeling, and model-checking techniques, the study assesses the ability of current models to generate verifiable specifications. Experimental results uncover critical limitations in existing LLMs for this task and provide clear directions for future improvements in bridging the gap between natural-language-driven code generation and formal verification.

Technology Category

Application Category

📝 Abstract

In the digital age, ensuring the correctness, safety, and reliability of software through formal verification is paramount, particularly as software increasingly underpins critical infrastructure. Formal verification, split into theorem proving and model checking, provides a feasible and reliable path. Unlike theorem proving, which yields notable advances, model checking has been less focused due to the difficulty of automatic program modeling. To fill this gap, we introduce Model-Bench, a benchmark and an accompanying pipeline for evaluating and improving LLMs' program modeling capability by modeling Python programs into verification-ready model checking specifications checkable by its accompanying model checker. Model-Bench comprises 400 Python programs derived from three well-known benchmarks (HumanEval, MBPP, and LiveCodeBench). Our extensive experiments reveal significant limitations in LLMs' program modeling and further provide inspiring directions.

Problem

Research questions and friction points this paper is trying to address.

Large Language Models

Formal Verification

Model Checking

Program Modeling

Software Correctness

Innovation

Methods, ideas, or system contributions that make the work stand out.

Model-Bench

program modeling

formal verification