🤖 AI Summary
This work identifies pervasive self-inconsistency in large language models (LLMs)—i.e., logically contradictory outputs across multiple inferences on identical inputs—when performing elementary symbolic reasoning tasks such as geometric point-ordering and kinship relation inference, thereby undermining decision reliability and interpretability. To systematically quantify and mitigate this issue, we propose the first automated inconsistency detection and correction framework grounded in graph-structured modeling and energy-based optimization, accompanied by a differentiable coherence evaluation metric. Experiments across major open- and closed-weight models—including DeepSeek-R1 and GPT-4-mini—demonstrate significant inconsistency even in basic reasoning. Our method consistently improves consistency scores, yet complete elimination remains fundamentally challenging. This study establishes a novel paradigm and benchmark toolkit for assessing LLM reliability and enabling trustworthy symbolic reasoning.
📝 Abstract
Large Language Models (LLMs) have grown increasingly powerful, yet ensuring their decisions remain transparent and trustworthy requires self-consistency -- no contradictions in their internal reasoning. Our study reveals that even on simple tasks, such as comparing points on a line or a plane, or reasoning in a family tree, all smaller models are highly inconsistent, and even state-of-the-art models like DeepSeek-R1 and GPT-o4-mini are not fully self-consistent. To quantify and mitigate these inconsistencies, we introduce inconsistency metrics and propose two automated methods -- a graph-based and an energy-based approach. While these fixes provide partial improvements, they also highlight the complexity and importance of self-consistency in building more reliable and interpretable AI. The code and data are available at https://github.com/scorpio-nova/llm-self-consistency.