VERGE: Formal Refinement and Guidance Engine for Verifiable LLM Reasoning

📅 2026-01-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of ensuring logical correctness in large language models (LLMs) when deployed in high-stakes domains by proposing a neuro-symbolic system that integrates LLMs with SMT solvers to iteratively refine and verify generated answers. The approach decomposes model outputs into atomic propositions, automatically formalizes them into first-order logic, and employs automated theorem proving to validate consistency. Key innovations include a multi-model consensus mechanism based on formal semantic equivalence, a semantic routing strategy tailored to proposition types, and a precise logical error localization method grounded in Minimal Correction Sets (MCS). Experimental results demonstrate that this framework improves average performance by 18.7% over the GPT-OSS-120B model across multiple reasoning benchmarks, substantially enhancing the logical reliability of generated answers.

Technology Category

Application Category

📝 Abstract
Despite the syntactic fluency of Large Language Models (LLMs), ensuring their logical correctness in high-stakes domains remains a fundamental challenge. We present a neurosymbolic framework that combines LLMs with SMT solvers to produce verification-guided answers through iterative refinement. Our approach decomposes LLM outputs into atomic claims, autoformalizes them into first-order logic, and verifies their logical consistency using automated theorem proving. We introduce three key innovations: (1) multi-model consensus via formal semantic equivalence checking to ensure logic-level alignment between candidates, eliminating the syntactic bias of surface-form metrics, (2) semantic routing that directs different claim types to appropriate verification strategies: symbolic solvers for logical claims and LLM ensembles for commonsense reasoning, and (3) precise logical error localization via Minimal Correction Subsets (MCS), which pinpoint the exact subset of claims to revise, transforming binary failure signals into actionable feedback. Our framework classifies claims by their logical status and aggregates multiple verification signals into a unified score with variance-based penalty. The system iteratively refines answers using structured feedback until acceptance criteria are met or convergence is achieved. This hybrid approach delivers formal guarantees where possible and consensus verification elsewhere, advancing trustworthy AI. With the GPT-OSS-120B model, VERGE demonstrates an average performance uplift of 18.7% at convergence across a set of reasoning benchmarks compared to single-pass approaches.
Problem

Research questions and friction points this paper is trying to address.

logical correctness
Large Language Models
verifiable reasoning
high-stakes domains
formal verification
Innovation

Methods, ideas, or system contributions that make the work stand out.

neurosymbolic reasoning
formal verification
semantic routing
Minimal Correction Subsets
iterative refinement
V
Vikash Singh
Case Western Reserve University
D
Darion Cassel
Amazon Web Services
Nathaniel Weir
Nathaniel Weir
Johns Hopkins University
Natural Language ProcessingArtificial IntelligenceLinguistics
Nick Feng
Nick Feng
University of Toronto
Software EngineeringVerification
S
Sam Bayless
Amazon Web Services