DivLogicEval: A Framework for Benchmarking Logical Reasoning Evaluation in Large Language Models

📅 2025-09-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing logical reasoning benchmarks suffer from insufficient linguistic diversity and distributional bias, leading to distorted model evaluations. To address this, we propose DivLogicEval—the first debiased evaluation framework for classical logical reasoning. Our approach (1) generates counterintuitive, linguistically diverse natural-language instances grounded in formal logic, mitigating skill entanglement; (2) introduces a decoupled evaluation metric that explicitly controls for logical dependency strength; and (3) validates effectiveness via controlled-variable experiments. Empirical results demonstrate that DivLogicEval significantly discriminates among state-of-the-art large language models in terms of pure logical reasoning capability, uncovering persistent weaknesses under semantically neutral conditions. By improving linguistic coverage and reducing confounding biases, DivLogicEval establishes a more reliable, comprehensive benchmark for assessing foundational logical reasoning competence.

Technology Category

Application Category

📝 Abstract
Logic reasoning in natural language has been recognized as an important measure of human intelligence for Large Language Models (LLMs). Popular benchmarks may entangle multiple reasoning skills and thus provide unfaithful evaluations on the logic reasoning skill. Meanwhile, existing logic reasoning benchmarks are limited in language diversity and their distributions are deviated from the distribution of an ideal logic reasoning benchmark, which may lead to biased evaluation results. This paper thereby proposes a new classical logic benchmark DivLogicEval, consisting of natural sentences composed of diverse statements in a counterintuitive way. To ensure a more reliable evaluation, we also introduce a new evaluation metric that mitigates the influence of bias and randomness inherent in LLMs. Through experiments, we demonstrate the extent to which logical reasoning is required to answer the questions in DivLogicEval and compare the performance of different popular LLMs in conducting logical reasoning.
Problem

Research questions and friction points this paper is trying to address.

Evaluating logical reasoning in LLMs with diverse natural language
Addressing bias from limited language diversity in benchmarks
Introducing a new metric to reduce bias and randomness
Innovation

Methods, ideas, or system contributions that make the work stand out.

Classical logic benchmark with counterintuitive statements
New evaluation metric reducing bias and randomness
Diverse natural language sentences for reliable testing
🔎 Similar Papers
No similar papers found.