Training Language Models to Use Prolog as a Tool

📅 2025-12-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) frequently exhibit hallucinations and produce unverifiable outputs during reasoning. Method: This paper introduces Prolog as a verifiable external tool to construct a logically rigorous and auditable reasoning framework. It pioneers the application of Group Relative Policy Optimization (GRPO) to jointly optimize LLM–Prolog interaction—integrating prompt design, reward shaping, and inference protocols—and fine-tunes Qwen2.5-3B-Instruct on the GSM8K-Prolog-Prover dataset. Contributions/Results: The framework supports four inference modes: single-generation, best-of-N, and two agent-based variants. The 3B model matches the few-shot performance of a 7B model on MMLU zero-shot tasks. Best-of-N sampling combined with Prolog verification achieves SOTA accuracy on GSM8K. An integrated self-repair mechanism substantially improves generalization on MMLU-STEM and MMLU-PRO benchmarks.

Technology Category

Application Category

📝 Abstract
Ensuring reliable tool use is critical for safe agentic AI systems. Language models frequently produce unreliable reasoning with plausible but incorrect solutions that are difficult to verify. To address this, we investigate fine-tuning models to use Prolog as an external tool for verifiable computation. Using Group Relative Policy Optimization (GRPO), we fine-tune Qwen2.5-3B-Instruct on a cleaned GSM8K-Prolog-Prover dataset while varying (i) prompt structure, (ii) reward composition (execution, syntax, semantics, structure), and (iii) inference protocol: single-shot, best-of-N, and two agentic modes where Prolog is invoked internally or independently. Our reinforcement learning approach outperforms supervised fine-tuning, with our 3B model achieving zero-shot MMLU performance comparable to 7B few-shot results. Our findings reveal that: 1) joint tuning of prompt, reward, and inference shapes program syntax and logic; 2) best-of-N with external Prolog verification maximizes accuracy on GSM8K; 3) agentic inference with internal repair yields superior zero-shot generalization on MMLU-Stem and MMLU-Pro. These results demonstrate that grounding model reasoning in formal verification systems substantially improves reliability and auditability for safety-critical applications. The source code for reproducing our experiments is available under https://github.com/niklasmellgren/grpo-prolog-inference
Problem

Research questions and friction points this paper is trying to address.

Address unreliable reasoning in language models through verifiable computation
Fine-tune models to use Prolog as external tool for reliable solutions
Improve reliability and auditability in safety-critical AI applications
Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-tuning models to use Prolog for verifiable computation
Applying GRPO with varied prompts, rewards, and inference protocols
Grounding reasoning in formal verification to improve reliability
🔎 Similar Papers
No similar papers found.