Do LLMs Really Struggle at NL-FOL Translation? Revealing their Strengths via a Novel Benchmarking Strategy

📅 2025-11-14

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

Existing evaluations of large language models (LLMs) on natural language-to-first-order-logic (NL-FOL) translation yield inconsistent conclusions, primarily because conventional metrics conflate genuine logical understanding with superficial pattern matching. Method: We propose a novel evaluation paradigm featuring a controllable benchmark, controlled comparative experiments between embedding-centric and dialogue-oriented LLMs, and a rigorous variable-control protocol to systematically isolate deep logical reasoning from data memorization effects. Contribution/Results: Our experiments demonstrate that state-of-the-art conversational LLMs achieve high accuracy on sentence-level NL-FOL translation and exhibit authentic mastery of logical semantics. This work exposes critical limitations in prevailing evaluation frameworks and establishes a reproducible, decomposable methodology for assessing LLMs’ formal logical capabilities—advancing both diagnostic rigor and theoretical interpretability in neuro-symbolic reasoning research.

Technology Category

Application Category

📝 Abstract

Due to its expressiveness and unambiguous nature, First-Order Logic (FOL) is a powerful formalism for representing concepts expressed in natural language (NL). This is useful, e.g., for specifying and verifying desired system properties. While translating FOL into human-readable English is relatively straightforward, the inverse problem, converting NL to FOL (NL-FOL translation), has remained a longstanding challenge, for both humans and machines. Although the emergence of Large Language Models (LLMs) promised a breakthrough, recent literature provides contrasting results on their ability to perform NL-FOL translation. In this work, we provide a threefold contribution. First, we critically examine existing datasets and protocols for evaluating NL-FOL translation performance, revealing key limitations that may cause a misrepresentation of LLMs' actual capabilities. Second, to overcome these shortcomings, we propose a novel evaluation protocol explicitly designed to distinguish genuine semantic-level logical understanding from superficial pattern recognition, memorization, and dataset contamination. Third, using this new approach, we show that state-of-the-art, dialogue-oriented LLMs demonstrate strong NL-FOL translation skills and a genuine grasp of sentence-level logic, whereas embedding-centric models perform markedly worse.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' true capability in translating natural language to First-Order Logic

Addressing limitations in existing NL-FOL translation evaluation datasets and protocols

Distinguishing genuine logical understanding from superficial pattern recognition in LLMs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Novel evaluation protocol for logical understanding

Distinguishes semantic comprehension from pattern recognition

Dialogue-oriented LLMs show strong translation skills

🔎 Similar Papers

Is Translation All You Need? A Study on Solving Multilingual Tasks with Large Language Models