🤖 AI Summary
Translating natural language to first-order logic (FOL) remains a fundamental challenge in knowledge representation and formal reasoning. This work systematically evaluates large language models (LLMs) on this task, introducing two novel strategies: predicate conditioning—leveraging explicit predicate inventories—and multilingual joint training. We further propose a dual-axis evaluation framework grounded in logical equivalence and predicate alignment. Experiments demonstrate that Flan-T5-XXL substantially outperforms both GPT-4o and the symbolic system ccg2lambda, achieving 70% accuracy when provided with predicate lists—a 15–20% gain over baselines. Crucially, encoder-decoder architectures (e.g., T5 variants) exhibit superior logical generalization compared to decoder-only models, showing robust cross-dataset performance on MALLS, Willow, and FOLIO. These findings establish a scalable, rigorously evaluable, LLM-driven paradigm for FOL formalization.
📝 Abstract
Automating the translation of natural language to first-order logic (FOL) is crucial for knowledge representation and formal methods, yet remains challenging. We present a systematic evaluation of fine-tuned LLMs for this task, comparing architectures (encoder-decoder vs. decoder-only) and training strategies. Using the MALLS and Willow datasets, we explore techniques like vocabulary extension, predicate conditioning, and multilingual training, introducing metrics for exact match, logical equivalence, and predicate alignment. Our fine-tuned Flan-T5-XXL achieves 70% accuracy with predicate lists, outperforming GPT-4o and even the DeepSeek-R1-0528 model with CoT reasoning ability as well as symbolic systems like ccg2lambda. Key findings show: (1) predicate availability boosts performance by 15-20%, (2) T5 models surpass larger decoder-only LLMs, and (3) models generalize to unseen logical arguments (FOLIO dataset) without specific training. While structural logic translation proves robust, predicate extraction emerges as the main bottleneck.