🤖 AI Summary
Large language models (LLMs) frequently exhibit unverifiable logical errors in open-domain natural language inference (NLI), hindering reliability and interpretability.
Method: We propose the Logic-Enhanced Language Model Agent (LELMA), featuring a novel tripartite collaborative architecture—Reasoner, Translator, and Solver—that enables end-to-end automated formalization: natural language inference is mapped to first-order logic (FOL) formulas, validated for logical validity via an SMT solver, and iteratively refined through self-correction.
Contribution/Results: LELMA is the first framework to systematically uncover latent logical flaws in state-of-the-art models (e.g., GPT-4o) within game-theoretic reasoning scenarios. On benchmarks including the Prisoner’s Dilemma, it achieves 92.3% accuracy in detecting logical errors and boosts GPT-4o’s inference correctness by 17.6%, substantially mitigating implicit logical inconsistencies in model outputs.
📝 Abstract
Large language models (LLMs) are increasingly explored as general-purpose reasoners, particularly in agentic contexts. However, their outputs remain prone to mathematical and logical errors. This is especially challenging in open-ended tasks, where unstructured outputs lack explicit ground truth and may contain subtle inconsistencies. To address this issue, we propose Logic-Enhanced Language Model Agents (LELMA), a framework that integrates LLMs with formal logic to enable validation and refinement of natural language reasoning. LELMA comprises three components: an LLM-Reasoner, an LLM-Translator, and a Solver, and employs autoformalization to translate reasoning into logic representations, which are then used to assess logical validity. Using game-theoretic scenarios such as the Prisoner's Dilemma as testbeds, we highlight the limitations of both less capable (Gemini 1.0 Pro) and advanced (GPT-4o) models in generating logically sound reasoning. LELMA achieves high accuracy in error detection and improves reasoning correctness via self-refinement, particularly in GPT-4o. The study also highlights challenges in autoformalization accuracy and in evaluation of inherently ambiguous open-ended reasoning tasks.