Assessing the Alignment of FOL Closeness Metrics with Human Judgement

📅 2025-01-15

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the misalignment between automated first-order logic (FOL) similarity metrics and human judgments. We systematically evaluate mainstream metrics—including BLEU, Smatch++, and FOL-specific measures—under three formal perturbation types: operator-level, structural, and textual. Our analysis reveals a pervasive over-sensitivity issue across existing metrics. BertScore achieves the highest consistency with human rankings, attaining optimal Kendall’s τ. To mitigate over-sensitivity while preserving discriminative power, we propose a novel multi-metric weighted fusion framework. This approach significantly improves alignment with human judgment, boosting Spearman’s ρ by 12.7% over all individual metrics. To our knowledge, this is the first empirically validated, reliability- and robustness-aware evaluation benchmark for LLM-driven logical reasoning assessment.

Technology Category

Application Category

📝 Abstract

The recent successful paradigm of solving logical reasoning problems with tool-augmented large language models (LLMs) leverages translation of natural language statements into First-Order Logic~(FOL) and external theorem provers. However, the correctness of FOL statements, comprising operators and text predicates, often goes unverified due to the lack of a reliable evaluation metric for comparing generated and ground-truth FOLs. In this paper, we present a comprehensive study of sensitivity of existing metrics and their alignment with human judgement on FOL evaluation. Using ground-truth FOLs, we carefully designed various perturbations on the ground-truth to assess metric sensitivity. We sample FOL translation candidates for natural language statements and measure the ranking alignment between automatic metrics and human annotators. Our empirical findings highlight oversensitivity in the n-gram metric BLEU for text perturbations, the semantic graph metric Smatch++ for structural perturbations, and FOL metric for operator perturbation. We also observe a closer alignment between BertScore and human judgement. Additionally, we show that combining metrics enhances both alignment and sensitivity compared to using individual metrics.

Problem

Research questions and friction points this paper is trying to address.

First-Order Logic

Similarity Calculation

Human Judgment Consistency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Natural Language Processing

BertScore

First-Order Logic Evaluation

🔎 Similar Papers

No similar papers found.

Authors to Follow