๐ค AI Summary
This work addresses the challenge in natural language inference where long texts and multi-step reasoning often lead to errors in automatic formalization, and existing approaches struggle to precisely localize such errors, frequently resorting to inefficient global regeneration. The authors propose a decomposition-formalization framework that structures premise-hypothesis pairs into trees of atomic inference steps. By employing bottom-up validation and a ฮธ-substitution mechanism, the method enables fine-grained error localization and consistent event-role binding, allowing for targeted corrections based on local diagnostics. Integrating large language models with a theorem prover, the approach significantly improves explanation verification rates by 21.6%โ48.9% across five state-of-the-art models, while reducing both the number of iterations and runtime, and maintaining high NLI accuracy.
๐ Abstract
Recent work has shown that integrating large language models (LLMs) with theorem provers (TPs) in neuro-symbolic pipelines helps with entailment verification and proof-guided refinement of explanations for natural language inference (NLI). However, scaling such refinement to naturalistic NLI remains difficult: long, syntactically rich inputs and deep multi-step arguments amplify autoformalisation errors, where a single local mismatch can invalidate the proof. Moreover, current methods often handle failures via costly global regeneration due to the difficulty of localising the responsible span or step from prover diagnostics. Aiming to address these problems, we propose a decompose-and-formalise framework that (i) decomposes premise-hypothesis pairs into an entailment tree of atomic steps, (ii) verifies the tree bottom-up to isolate failures to specific nodes, and (iii) performs local diagnostic-guided refinement instead of regenerating the whole explanation. Moreover, to improve faithfulness of autoformalisation, we introduce $\theta$-substitution in an event-based logical form to enforce consistent argument-role bindings. Across a range of reasoning tasks using five LLM backbones, our method achieves the highest explanation verification rates, improving over the state-of-the-art by 26.2%, 21.7%, 21.6% and 48.9%, while reducing refinement iterations and runtime and preserving strong NLI accuracy.