LiveFMBench: Unveiling the Power and Limits of Agentic Workflows in Specification Generation

📅 2026-05-02
📈 Citations: 0
Influential: 0
📄 PDF

career value

192K/year
🤖 AI Summary
Automatically generating formal specifications for C programs in ACSL faces significant challenges, including high costs, limited automation, and unreliable models, while existing evaluations are often compromised by data contamination and model deception. This work proposes LiveFMBench—a continuously evolving benchmark comprising 630 C programs annotated with ACSL specifications—and introduces, for the first time, a contamination-aware systematic evaluation framework. By integrating large language models, agent-based workflows, chain-of-thought reasoning, and multi-scale sampling strategies, and rigorously validating outputs using a theorem prover, the study reveals that specification accuracy drops by approximately 20% once deceptive outputs are excluded. Notably, agent-based workflows substantially outperform baselines under low sampling budgets and in generating complex loop invariants, with particularly pronounced gains for smaller models.
📝 Abstract
Formal specification is essential for rigorous program verification, yet writing correct specifications remains costly and difficult to automate. Although large language models (LLMs) and agents have shown promising progress, their true capabilities and failure modes remain unclear. We present the first systematic and contamination-aware study of LLM- and agent-based formal specification generation for C programs. We introduce LiveFMBench, a continuously evolving benchmark of 630 ACSL (ANSI/ISO C Specification Language)-annotated C programs, including 360 newly collected cases designed to mitigate data leakage. Using this benchmark, we evaluate direct prompting with different sampling sizes, reasoning-enabled (thinking mode) inference, the agentic pipeline, and perform a fine-grained failure analysis. Experimental results reveal that naive evaluation substantially overestimates performance because models under direct prompting may exhibit unfaithful behaviors, such as deceiving automated provers or ignoring code-context constraints; after excluding such cases, the true specification generation accuracy drops by approximately 20\%. We further find that both increased sampling and thinking mode significantly improve success rates, with smaller models benefiting more from thinking mode. Agentic pipelines are particularly effective under low sampling budgets and on harder datasets. Failure analysis further shows that incorrect loop invariants are the dominant error type, while agentic pipelines notably reduce assertion errors. These results expose fundamental limitations in current LLM-based approaches and suggest they remain far from replacing human-authored formal specifications. We release LiveFMBench at https://huggingface.co/datasets/fm-universe/Live-FM-Bench and all evaluation artifacts to support future research.
Problem

Research questions and friction points this paper is trying to address.

formal specification
large language models
agent-based workflows
ACSL
specification generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

formal specification
agentic workflows
LiveFMBench
ACSL
failure analysis