SATQuest: A Verifier for Logical Reasoning Evaluation and Reinforcement Fine-Tuning of LLMs

📅 2025-08-31

📈 Citations: 0

✨ Influential: 0

career value

173K/year

🤖 AI Summary

Existing logical reasoning benchmarks lack controllable, scalable, and fine-grained analytical tools, hindering systematic evaluation of LLMs across dimensions such as scale, question type, and input format. Method: We propose SATQuest—a framework that automatically generates diverse logical reasoning tasks from conjunctive normal form (CNF) formulas; leverages PySAT for memoryless, automated answer verification; incorporates a multidimensionally controllable problem generation mechanism; and introduces SATQuest-specific reward signals to guide reinforcement learning–based fine-tuning. Contribution/Results: Experiments reveal significant deficiencies in mainstream LLMs on non-mathematical-format logical reasoning. After SATQuest-guided fine-tuning, models achieve substantial performance gains on target tasks and demonstrate strong generalization to more complex SAT instances. SATQuest establishes a novel paradigm for interpretable, diagnostic evaluation and targeted enhancement of logical reasoning capabilities.

Technology Category

Application Category

📝 Abstract

Recent advances in Large Language Models (LLMs) have demonstrated remarkable general reasoning capabilities. However, systematically evaluating and enhancing these reasoning capabilities is challenging due to the lack of controllable and scalable tools for fine-grained analysis. Existing benchmarks and datasets often lack the necessary variable control for multi-dimensional, systematic analysis and training, or have narrow problem types and formats. To address these limitations, we introduce SATQuest, a systematic verifier designed to evaluate and enhance logical reasoning in LLMs by generating diverse, Satisfiability-based logical reasoning problems directly from Conjunctive Normal Form (CNF) instances. SATQuest structures these problems along three orthogonal dimensions: instance scale, problem type, and question format, employing randomized, SAT-based problem generation and objective answer verification via PySAT. This design mitigates memorization issues, allows for nuanced insights into reasoning performance, and enables effective reinforcement fine-tuning. Our extensive evaluation of various LLMs using SATQuest identified significant limitations in their logical reasoning, particularly in generalizing beyond familiar mathematical formats. Furthermore, we show that reinforcement fine-tuning with SATQuest rewards substantially improves targeted task performance and generalizes to more complex instances, while highlighting remaining challenges in cross-format adaptation. Through these demonstrations, we showcase SATQuest's potential as a foundational tool and a valuable starting point for advancing LLM logical reasoning.

Problem

Research questions and friction points this paper is trying to address.

Evaluates logical reasoning capabilities in large language models

Generates diverse SAT-based problems for systematic analysis

Enables reinforcement fine-tuning to enhance reasoning performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

SAT-based problem generation from CNF instances

Three-dimensional reasoning structure with orthogonal parameters

Objective answer verification via PySAT for reinforcement

🔎 Similar Papers

No similar papers found.