SLR: An Automated Synthesis Framework for Scalable Logical Reasoning

📅 2025-06-18

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

This work addresses the prevalent issue in large language models (LLMs) of generating syntactically valid yet logically incorrect rules during scalable logical reasoning tasks. To tackle this, we propose SLR—an end-to-end automated framework for synthesizing and validating logical rules. Methodologically, we introduce the first fully automated, annotation-free, and formally verifiable task synthesis paradigm, constructing a ternary task structure comprising *latent rules*, *executable verification programs*, and *instructional prompts*, and design SLR-Bench—a 20-level progressive benchmark. Our approach integrates formal rule modeling, symbolic verification generation, curriculum-driven synthesis, and logic-aware fine-tuning (*logic-tuning*). Comprehensive evaluation across 19k+ tasks systematically exposes pervasive logical deficiencies in LLMs. Logic-tuning boosts Llama-3-8B’s accuracy by 100%, matching Gemini-Flash-Thinking’s performance while reducing computational overhead by an order of magnitude.

Technology Category

Application Category

📝 Abstract

We introduce SLR, an end-to-end framework for systematic evaluation and training of Large Language Models (LLMs) via Scalable Logical Reasoning. Given a user's task specification, SLR enables scalable, automated synthesis of inductive reasoning tasks with precisely controlled difficulty. For each task, SLR synthesizes (i) a latent ground-truth rule, (ii) an executable validation program used by a symbolic judge to deterministically verify model outputs, and (iii) an instruction prompt for the reasoning task. Using SLR, we create SLR-Bench, a benchmark comprising over 19k prompts spanning 20 curriculum levels that progressively increase in relational, arithmetic, and recursive complexity. Large-scale evaluation reveals that contemporary LLMs readily produce syntactically valid rules, yet often fail at correct logical inference. Recent reasoning LLMs do somewhat better, but incur substantial increases in test-time compute, sometimes exceeding 15k completion tokens. Finally, logic-tuning via SLR doubles Llama-3-8B accuracy on SLR-Bench, achieving parity with Gemini-Flash-Thinking at a fraction of computational cost. SLR is fully automated, requires no human annotation, ensures dataset novelty, and offers a scalable environment for probing and advancing LLMs'reasoning capabilities.

Problem

Research questions and friction points this paper is trying to address.

Automated synthesis of scalable logical reasoning tasks

Evaluation of LLMs' accuracy in logical inference

Logic-tuning to improve LLMs' reasoning efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated synthesis of inductive reasoning tasks

Symbolic judge for deterministic output verification

Logic-tuning doubles accuracy with low compute

🔎 Similar Papers

No similar papers found.