🤖 AI Summary
Large language models (LLMs) exhibit inherent probabilistic output behavior, creating a fundamental tension with the deterministic guarantees required for formal verification when generating formal specifications. This leads to unreliable and hard-to-quantify uncertainty in LLM-produced specifications.
Method: We propose a probabilistic context-free grammar (PCFG)-based modeling framework to characterize specification-generation failures and quantify uncertainty. We introduce a task-adapted uncertainty taxonomy and design a lightweight, multi-signal selective verification mechanism integrating syntactic entropy, SMT solvability, and model confidence.
Contribution/Results: Our approach reduces error rates by 14–100% across benchmarks. On logical specification tasks, syntactic entropy achieves AUROC > 0.93 in predicting specification uncertainty. The framework significantly enhances both the reliability and engineering deployability of LLM-generated formal specifications, enabling principled uncertainty-aware verification.
📝 Abstract
Large language models (LLMs) show remarkable promise for democratizing automated reasoning by generating formal specifications. However, a fundamental tension exists: LLMs are probabilistic, while formal verification demands deterministic guarantees. This paper addresses this epistemological gap by comprehensively investigating failure modes and uncertainty quantification (UQ) in LLM-generated formal artifacts. Our systematic evaluation of five frontier LLMs reveals Satisfiability Modulo Theories (SMT) based autoformalization's domain-specific impact on accuracy (from +34.8% on logical tasks to -44.5% on factual ones), with known UQ techniques like the entropy of token probabilities failing to identify these errors. We introduce a probabilistic context-free grammar (PCFG) framework to model LLM outputs, yielding a refined uncertainty taxonomy. We find uncertainty signals are task-dependent (e.g., grammar entropy for logic, AUROC>0.93). Finally, a lightweight fusion of these signals enables selective verification, drastically reducing errors (14-100%) with minimal abstention, transforming LLM-driven formalization into a reliable engineering discipline.