OSVBench: Benchmarking LLMs on Specification Generation Tasks for Operating System Verification

📅 2025-04-29

📈 Citations: 0

✨ Influential: 0

career value

217K/year

🤖 AI Summary

This work systematically evaluates large language models’ (LLMs) capability to generate formal specification code for operating system (OS) kernel verification. To this end, we introduce OSVBench—the first long-context benchmark tailored for kernel verification—comprising 245 tasks (20k–30k tokens) derived from the real-world Hyperkernel system. We formulate specification generation as a program synthesis problem constrained within a syntactically and semantically grounded space. Our method proposes a novel structured generation paradigm guided by verification hypotheses and functional descriptions, supported by a unified framework integrating program synthesis modeling, OS-specific semantic embedding, and multi-model evaluation. Evaluation across 12 state-of-the-art LLMs reveals a low average accuracy of only 17.8%, exposing fundamental limitations in cross-abstraction reasoning, long-range dependency modeling, and formal semantic consistency preservation. All benchmark data, prompts, and evaluation tooling are publicly released.

Technology Category

Application Category

📝 Abstract

We introduce OSVBench, a new benchmark for evaluating Large Language Models (LLMs) in generating complete specification code pertaining to operating system kernel verification tasks. The benchmark first defines the specification generation problem into a program synthesis problem within a confined scope of syntax and semantics by providing LLMs with the programming model. The LLMs are required to understand the provided verification assumption and the potential syntax and semantics space to search for, then generate the complete specification for the potentially buggy operating system code implementation under the guidance of the high-level functional description of the operating system. This benchmark is built upon a real-world operating system kernel, Hyperkernel, and consists of 245 complex specification generation tasks in total, each is a long context task of about 20k-30k tokens. Our comprehensive evaluation of 12 LLMs exhibits the limited performance of the current LLMs on the specification generation tasks for operating system verification. Significant disparities in their performance on the benchmark highlight differences in their ability to handle long-context code generation tasks. The evaluation toolkit and benchmark are available at https://github.com/lishangyu-hkust/OSVBench.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs on OS kernel specification generation tasks

Defining specification generation as constrained program synthesis

Assessing LLM performance on long-context OS verification challenges

Innovation

Methods, ideas, or system contributions that make the work stand out.

Defines specification generation as program synthesis

Uses real-world OS kernel Hyperkernel for tasks

Evaluates 12 LLMs on long-context code generation

🔎 Similar Papers

SpecGen: Automated Generation of Formal Program Specifications via Large Language Models