🤖 AI Summary
This work addresses the unreliability of open-weight large language models (LLMs) in complex, multi-step spatial reasoning. We propose the first verifiable reasoning framework integrating conformal language modeling (CLM) with answer set programming (ASP). Methodologically, CLM calibrates LLM-generated structured intermediate steps, ASP enforces logical consistency verification, and an LLM-as-Judge mechanism enables dynamic confidence assessment and result filtering. Our core contribution is the first synergistic embedding of conformal prediction’s statistical guarantees and ASP’s formal reasoning capabilities into the LLM reasoning chain—yielding provable confidence bounds for each inference step. Experiments on the StepGame dataset demonstrate that our approach significantly outperforms standard sampling baselines, with accuracy robustly improving as reasoning depth increases. Moreover, the LLM-as-Judge component enhances precision in identifying logically correct outputs.
📝 Abstract
In this paper, we examine the use of Conformal Language Modelling (CLM) alongside Answer Set Programming (ASP) to enhance the performance of standard open-weight LLMs on complex multi-step reasoning tasks. Using the StepGame dataset, which requires spatial reasoning, we apply CLM to generate sets of ASP programs from an LLM, providing statistical guarantees on the correctness of the outputs. Experimental results show that CLM significantly outperforms baseline models that use standard sampling methods, achieving substantial accuracy improvements across different levels of reasoning complexity. Additionally, the LLM-as-Judge metric enhances CLM's performance, especially in assessing structurally and logically correct ASP outputs. However, calibrating CLM with diverse calibration sets did not improve generalizability for tasks requiring much longer reasoning steps, indicating limitations in handling more complex tasks.