🤖 AI Summary
This work addresses the challenge of format and semantic errors produced by small language models when translating natural language into first-order logic (FOL), which undermines the reliability of symbolic reasoning. To mitigate this, the authors propose a staged incremental reasoning framework: first, a large language model synthesizes training data to supervise fine-tuning of the small model; then, the translation process is decoupled into predicate generation and FOL formulation stages. An external verification module is introduced to detect and correct predicate arity errors, thereby enhancing translation accuracy. Evaluated on four logical reasoning benchmarks, the approach significantly reduces error rates, improves predicate coverage, and boosts overall reasoning performance, bringing small models closer to reliable, verifiable symbolic reasoning systems.
📝 Abstract
The use of formal language for deductive logical reasoning aligns well with language models (LMs), where translating natural language (NL) into first-order logic (FOL) and employing an external solver results in a verifiable and therefore reliable reasoning system. However, smaller LMs often struggle with this translation task, frequently producing incorrect symbolic outputs due to formatting and translation errors. Existing approaches typically rely on self-iteration to correct these errors, but such methods depend heavily on the capabilities of the underlying model. To address this, we first categorize common errors and fine-tune smaller LMs using data synthesized by large language models. The evaluation is performed using the defined error categories. We introduce incremental inference, which divides inference into two stages, predicate generation and FOL translation, providing greater control over model behavior and enhancing generation quality as measured by predicate metrics. This decomposition framework also enables the use of a verification module that targets predicate-arity errors to further improve performance. Our study evaluates three families of models across four logical-reasoning datasets. The comprehensive fine-tuning, incremental inference, and verification modules reduce error rates, increase predicate coverage, and improve reasoning performance for smaller LMs, moving us closer to developing reliable and accessible symbolic-reasoning systems.