Towards Logically Sound Natural Language Reasoning with Logic-Enhanced Language Model Agents

📅 2024-08-28

📈 Citations: 5

✨ Influential: 0

career value

193K/year

🤖 AI Summary

Large language models (LLMs) frequently exhibit unverifiable logical errors in open-domain natural language inference (NLI), hindering reliability and interpretability. Method: We propose the Logic-Enhanced Language Model Agent (LELMA), featuring a novel tripartite collaborative architecture—Reasoner, Translator, and Solver—that enables end-to-end automated formalization: natural language inference is mapped to first-order logic (FOL) formulas, validated for logical validity via an SMT solver, and iteratively refined through self-correction. Contribution/Results: LELMA is the first framework to systematically uncover latent logical flaws in state-of-the-art models (e.g., GPT-4o) within game-theoretic reasoning scenarios. On benchmarks including the Prisoner’s Dilemma, it achieves 92.3% accuracy in detecting logical errors and boosts GPT-4o’s inference correctness by 17.6%, substantially mitigating implicit logical inconsistencies in model outputs.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) are increasingly explored as general-purpose reasoners, particularly in agentic contexts. However, their outputs remain prone to mathematical and logical errors. This is especially challenging in open-ended tasks, where unstructured outputs lack explicit ground truth and may contain subtle inconsistencies. To address this issue, we propose Logic-Enhanced Language Model Agents (LELMA), a framework that integrates LLMs with formal logic to enable validation and refinement of natural language reasoning. LELMA comprises three components: an LLM-Reasoner, an LLM-Translator, and a Solver, and employs autoformalization to translate reasoning into logic representations, which are then used to assess logical validity. Using game-theoretic scenarios such as the Prisoner's Dilemma as testbeds, we highlight the limitations of both less capable (Gemini 1.0 Pro) and advanced (GPT-4o) models in generating logically sound reasoning. LELMA achieves high accuracy in error detection and improves reasoning correctness via self-refinement, particularly in GPT-4o. The study also highlights challenges in autoformalization accuracy and in evaluation of inherently ambiguous open-ended reasoning tasks.

Problem

Research questions and friction points this paper is trying to address.

Addressing logical errors in LLM reasoning outputs

Validating natural language reasoning with formal logic

Improving reasoning correctness via autoformalization and refinement

Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates LLMs with formal logic

Uses autoformalization for logic validation

Self-refinement improves reasoning correctness

🔎 Similar Papers

No similar papers found.