Logical Consistency of Large Language Models in Fact-checking

📅 2024-12-20
🏛️ arXiv.org
📈 Citations: 1
Influential: 1
📄 PDF
🤖 AI Summary
This work addresses the critical issue of logical inconsistency in large language models (LLMs) when performing knowledge graph (KG)-augmented propositional logic fact-checking—particularly under complex logical queries involving negation, conjunction, and disjunction. To tackle this challenge, we propose a systematic solution comprising three key components: (1) the first curated benchmark suite explicitly designed to evaluate logical consistency across three categories of propositional logic queries; (2) a novel consistency metric for assessing LLM responses to formalized propositional logic queries; and (3) an integrated methodology combining retrieval-augmented generation (RAG), propositional logic formalization, KG embedding, and supervised fine-tuning to enhance reasoning stability. Empirical evaluation reveals severe logical inconsistency in state-of-the-art LLMs; our fine-tuned models achieve a 32.7% absolute improvement in consistency. All code and benchmarks are publicly released.

Technology Category

Application Category

📝 Abstract
In recent years, large language models (LLMs) have demonstrated significant success in performing varied natural language tasks such as language translation, question-answering, summarizing, fact-checking, etc. Despite LLMs' impressive ability to generate human-like texts, LLMs are infamous for their inconsistent responses - a meaning-preserving change in the input query results in an inconsistent response and attributes to vulnerabilities of LLMs such as hallucination. Consequently, existing research focuses on simple paraphrasing-based consistency assessment of LLMs, and ignores complex queries that necessitate an even better understanding of logical reasoning by an LLM. Our work therefore addresses the logical inconsistency of LLMs under complex logical queries with primitive logical operators, e.g., negation, conjunction, and disjunction. As a test bed, we consider retrieval-augmented LLMs on a fact-checking task involving propositional logic queries from knowledge graphs (KGs). Our contributions are threefold. Benchmark: We introduce three logical fact-checking datasets over KGs for community development towards logically consistent LLMs. Assessment: We propose consistency measures of LLMs on propositional logic queries and demonstrate that existing LLMs lack logical consistency, especially on complex queries. Improvement: We employ supervised fine-tuning to improve the logical consistency of LLMs on the complex fact-checking task with KG contexts. We have made our source code and benchmarks available.
Problem

Research questions and friction points this paper is trying to address.

Addresses logical inconsistency in large language models (LLMs) under complex logical queries.
Focuses on fact-checking tasks involving propositional logic queries from knowledge graphs (KGs).
Proposes measures and fine-tuning to improve LLMs' logical consistency on complex queries.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Supervised fine-tuning enhances LLM logical consistency.
New benchmarks assess LLMs on logical fact-checking tasks.
Retrieval-augmented LLMs tackle complex propositional logic queries.
🔎 Similar Papers
No similar papers found.
Bishwamittra Ghosh
Bishwamittra Ghosh
Postdoctoral Researcher, MPI-SWS, Germany
Trustworthy Machine learningFormal methods
S
Sarah Hasan
Aalborg University, Denmark
N
Naheed Anjum Arafat
Independent Researcher, USA
A
Arijit Khan
Aalborg University, Denmark