Existing LLMs Are Not Self-Consistent For Simple Tasks

📅 2025-06-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work identifies pervasive self-inconsistency in large language models (LLMs)—i.e., logically contradictory outputs across multiple inferences on identical inputs—when performing elementary symbolic reasoning tasks such as geometric point-ordering and kinship relation inference, thereby undermining decision reliability and interpretability. To systematically quantify and mitigate this issue, we propose the first automated inconsistency detection and correction framework grounded in graph-structured modeling and energy-based optimization, accompanied by a differentiable coherence evaluation metric. Experiments across major open- and closed-weight models—including DeepSeek-R1 and GPT-4-mini—demonstrate significant inconsistency even in basic reasoning. Our method consistently improves consistency scores, yet complete elimination remains fundamentally challenging. This study establishes a novel paradigm and benchmark toolkit for assessing LLM reliability and enabling trustworthy symbolic reasoning.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) have grown increasingly powerful, yet ensuring their decisions remain transparent and trustworthy requires self-consistency -- no contradictions in their internal reasoning. Our study reveals that even on simple tasks, such as comparing points on a line or a plane, or reasoning in a family tree, all smaller models are highly inconsistent, and even state-of-the-art models like DeepSeek-R1 and GPT-o4-mini are not fully self-consistent. To quantify and mitigate these inconsistencies, we introduce inconsistency metrics and propose two automated methods -- a graph-based and an energy-based approach. While these fixes provide partial improvements, they also highlight the complexity and importance of self-consistency in building more reliable and interpretable AI. The code and data are available at https://github.com/scorpio-nova/llm-self-consistency.
Problem

Research questions and friction points this paper is trying to address.

LLMs lack self-consistency in simple reasoning tasks
Existing models show contradictions in basic spatial and familial logic
New metrics and methods aim to quantify and reduce inconsistencies
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduce inconsistency metrics for LLMs
Propose graph-based automated correction method
Develop energy-based approach for improvements
🔎 Similar Papers
No similar papers found.