Grammars of Formal Uncertainty: When to Trust LLMs in Automated Reasoning Tasks

📅 2025-05-26

📈 Citations: 0

✨ Influential: 0

career value

178K/year

🤖 AI Summary

Large language models (LLMs) exhibit inherent probabilistic output behavior, creating a fundamental tension with the deterministic guarantees required for formal verification when generating formal specifications. This leads to unreliable and hard-to-quantify uncertainty in LLM-produced specifications. Method: We propose a probabilistic context-free grammar (PCFG)-based modeling framework to characterize specification-generation failures and quantify uncertainty. We introduce a task-adapted uncertainty taxonomy and design a lightweight, multi-signal selective verification mechanism integrating syntactic entropy, SMT solvability, and model confidence. Contribution/Results: Our approach reduces error rates by 14–100% across benchmarks. On logical specification tasks, syntactic entropy achieves AUROC > 0.93 in predicting specification uncertainty. The framework significantly enhances both the reliability and engineering deployability of LLM-generated formal specifications, enabling principled uncertainty-aware verification.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) show remarkable promise for democratizing automated reasoning by generating formal specifications. However, a fundamental tension exists: LLMs are probabilistic, while formal verification demands deterministic guarantees. This paper addresses this epistemological gap by comprehensively investigating failure modes and uncertainty quantification (UQ) in LLM-generated formal artifacts. Our systematic evaluation of five frontier LLMs reveals Satisfiability Modulo Theories (SMT) based autoformalization's domain-specific impact on accuracy (from +34.8% on logical tasks to -44.5% on factual ones), with known UQ techniques like the entropy of token probabilities failing to identify these errors. We introduce a probabilistic context-free grammar (PCFG) framework to model LLM outputs, yielding a refined uncertainty taxonomy. We find uncertainty signals are task-dependent (e.g., grammar entropy for logic, AUROC>0.93). Finally, a lightweight fusion of these signals enables selective verification, drastically reducing errors (14-100%) with minimal abstention, transforming LLM-driven formalization into a reliable engineering discipline.

Problem

Research questions and friction points this paper is trying to address.

Bridging probabilistic LLM outputs and deterministic formal verification requirements

Evaluating failure modes and uncertainty in LLM-generated formal specifications

Developing task-dependent uncertainty signals for reliable automated reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

SMT-based autoformalization for domain-specific accuracy

PCFG framework to model LLM uncertainty

Lightweight signal fusion for selective verification

🔎 Similar Papers

Semantic Self-Consistency: Enhancing Language Model Reasoning via Semantic Weighting