🤖 AI Summary
To address the lack of causal reasoning and verifiability in large language models (LLMs) for high-stakes question answering in closed domains (e.g., education), this paper proposes MCFR—a novel framework that tightly integrates LLMs with formal model checking for the first time. MCFR employs a neuro-symbolic architecture to automatically translate natural-language questions into formal specifications expressed as state-transition systems, and then performs property verification over the constructed transition system. It supports verifiable multi-step dynamic reasoning, including conditional transitions and procedural progress. Evaluated on EduMC-QA, a custom educational benchmark, MCFR significantly outperforms baseline LLMs—including ChatGPT, DeepSeek, and Claude—in reasoning accuracy and logical consistency. Moreover, it ensures factual correctness, interpretability, and regulatory compliance through formally grounded inference.
📝 Abstract
Reasoning is essential for closed-domain QA systems in which procedural correctness and policy compliance are critical. While large language models (LLMs) have shown strong performance on many reasoning tasks, recent work reveals that their reasoning traces are often unfaithful - serving more as plausible justifications than as causally grounded derivations. Efforts to combine LLMs with symbolic engines (e.g., Prover9, Z3) have improved reliability but remain limited to static forms of logic, struggling with dynamic, state-based reasoning such as multi-step progressions and conditional transitions.
In this paper, we propose MCFR (Model Checking for Formal Reasoning), a neuro-symbolic framework that integrates LLMs with model checking to support property verification. MCFR translates natural language into formal specifications and verifies them over transition models. To support evaluation, we introduce EduMC-QA, a benchmark dataset grounded in real academic procedures. Our results show that MCFR improves reasoning faithfulness and interpretability, offering a viable path toward verifiable QA in high-stakes closed-domain applications. In addition to evaluating MCFR, we compare its performance with state-of-the-art LLMs such as ChatGPT, DeepSeek, and Claude to contextualize its effectiveness.