Advancing Natural Language Formalization to First Order Logic with Fine-tuned LLMs

📅 2025-09-26

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

Translating natural language to first-order logic (FOL) remains a fundamental challenge in knowledge representation and formal reasoning. This work systematically evaluates large language models (LLMs) on this task, introducing two novel strategies: predicate conditioning—leveraging explicit predicate inventories—and multilingual joint training. We further propose a dual-axis evaluation framework grounded in logical equivalence and predicate alignment. Experiments demonstrate that Flan-T5-XXL substantially outperforms both GPT-4o and the symbolic system ccg2lambda, achieving 70% accuracy when provided with predicate lists—a 15–20% gain over baselines. Crucially, encoder-decoder architectures (e.g., T5 variants) exhibit superior logical generalization compared to decoder-only models, showing robust cross-dataset performance on MALLS, Willow, and FOLIO. These findings establish a scalable, rigorously evaluable, LLM-driven paradigm for FOL formalization.

Technology Category

Application Category

📝 Abstract

Automating the translation of natural language to first-order logic (FOL) is crucial for knowledge representation and formal methods, yet remains challenging. We present a systematic evaluation of fine-tuned LLMs for this task, comparing architectures (encoder-decoder vs. decoder-only) and training strategies. Using the MALLS and Willow datasets, we explore techniques like vocabulary extension, predicate conditioning, and multilingual training, introducing metrics for exact match, logical equivalence, and predicate alignment. Our fine-tuned Flan-T5-XXL achieves 70% accuracy with predicate lists, outperforming GPT-4o and even the DeepSeek-R1-0528 model with CoT reasoning ability as well as symbolic systems like ccg2lambda. Key findings show: (1) predicate availability boosts performance by 15-20%, (2) T5 models surpass larger decoder-only LLMs, and (3) models generalize to unseen logical arguments (FOLIO dataset) without specific training. While structural logic translation proves robust, predicate extraction emerges as the main bottleneck.

Problem

Research questions and friction points this paper is trying to address.

Automating natural language to first-order logic translation

Evaluating fine-tuned LLMs for logical formalization tasks

Addressing predicate extraction as the main bottleneck

Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-tuned Flan-T5-XXL for logic translation

Predicate conditioning boosts performance significantly

Generalization to unseen arguments without training

🔎 Similar Papers

No similar papers found.