🤖 AI Summary
Large language models (LLMs) lack reliable mechanisms to verify constraint satisfaction in their outputs, rendering probabilistic guarantees unstable and sampling-based estimates insufficient.
Method: We propose the first practical, deterministic verification framework for LLM output constraints, yielding theoretically grounded, tight upper bounds on violation probabilities—replacing unreliable sampling. Our approach introduces prefix-closed semantic modeling, systematic search over the generation space, and novel data structures—token tries and frontier sets—that dynamically maintain provable reputation bounds at each decoding step.
Results: Evaluated on multiple state-of-the-art LLMs, our framework achieves 6–8× tighter probability bounds than prior methods and detects high-risk instances 3–4× more effectively than baselines. It significantly enhances precise risk assessment for correctness, privacy, and safety—enabling rigorous, computationally efficient certification of constrained LLM behavior.
📝 Abstract
As large language models (LLMs) transition from research prototypes to production systems, practitioners often need reliable methods to verify that model outputs satisfy required constraints. While sampling-based estimates provide an intuition of model behavior, they offer no sound guarantees. We present BEAVER, the first practical framework for computing deterministic, sound probability bounds on LLM constraint satisfaction. Given any prefix-closed semantic constraint, BEAVER systematically explores the generation space using novel token trie and frontier data structures, maintaining provably sound bounds at every iteration. We formalize the verification problem, prove soundness of our approach, and evaluate BEAVER on correctness verification, privacy verification and secure code generation tasks across multiple state of the art LLMs. BEAVER achieves 6 to 8 times tighter probability bounds and identifies 3 to 4 times more high risk instances compared to baseline methods under identical computational budgets, enabling precise characterization and risk assessment that loose bounds or empirical evaluation cannot provide.