Large Language Models Meet Symbolic Provers for Logical Reasoning Evaluation

📅 2025-02-10

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

Existing first-order logic (FOL) reasoning benchmarks suffer from high manual annotation costs and rigid templating, limiting their complexity, scalability, and diversity. To address these limitations, we propose ProverGen—a novel framework that synergistically integrates a large language model (Llama3.1-8B-Instruct) with symbolic theorem provers (Vampire and Lean) to automatically generate high-quality, scalable, and diverse FOL reasoning data. The resulting dataset, ProverQA, contains samples each annotated with logically verifiable intermediate inference steps. Leveraging chain-of-thought prompting and supervised fine-tuning, ProverQA substantially increases task difficulty: state-of-the-art CoT methods perform poorly on it, while models fine-tuned on ProverQA achieve significant gains on both in-distribution and out-of-distribution evaluation sets. These results demonstrate ProverGen’s effectiveness and strong generalization capability for FOL reasoning.

Technology Category

Application Category

📝 Abstract

First-order logic (FOL) reasoning, which involves sequential deduction, is pivotal for intelligent systems and serves as a valuable task for evaluating reasoning capabilities, particularly in chain-of-thought (CoT) contexts. Existing benchmarks often rely on extensive human annotation or handcrafted templates, making it difficult to achieve the necessary complexity, scalability, and diversity for robust evaluation. To address these limitations, we propose a novel framework called ProverGen that synergizes the generative strengths of Large Language Models (LLMs) with the rigor and precision of symbolic provers, enabling the creation of a scalable, diverse, and high-quality FOL reasoning dataset, ProverQA. ProverQA is also distinguished by its inclusion of accessible and logically coherent intermediate reasoning steps for each problem. Our evaluation shows that state-of-the-art LLMs struggle to solve ProverQA problems, even with CoT prompting, highlighting the dataset's challenging nature. We also finetune Llama3.1-8B-Instruct on a separate training set generated by our framework. The finetuned model demonstrates consistent improvements on both in-distribution and out-of-distribution test sets, suggesting the value of our proposed data generation framework. Code available at: https://github.com/opendatalab/ProverGen

Problem

Research questions and friction points this paper is trying to address.

Evaluate FOL reasoning in intelligent systems.

Overcome limitations of human-annotated benchmarks.

Generate scalable, diverse FOL reasoning datasets.

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLMs combined with symbolic provers

Generates scalable FOL reasoning dataset

Includes coherent intermediate reasoning steps

🔎 Similar Papers

No similar papers found.