GLoRE: Evaluating Logical Reasoning of Large Language Models

📅 2023-10-13
🏛️ arXiv.org
📈 Citations: 4
Influential: 0
📄 PDF
🤖 AI Summary
Logical reasoning capability remains a critical yet systematically under-evaluated dimension of natural language understanding, lacking standardized benchmarks and unified evaluation platforms. To address this gap, we introduce GLoRE—a General Logical Reasoning Evaluation platform—featuring the first standardized evaluation framework specifically designed for logical reasoning. GLoRE unifies heterogeneous datasets through format normalization and cross-task alignment, and supports zero-shot and few-shot prompting protocols. It establishes a scalable, continuously updatable “living” evaluation ecosystem to bridge the assessment disparity between commercial and open-source models. Empirical results demonstrate that state-of-the-art reasoning models—e.g., QwQ-32B—achieve substantial gains on GLoRE, surpassing both human baselines and supervised fine-tuned models, thereby setting new SOTA in logical reasoning evaluation. The platform is fully open-sourced and has been adopted by Hugging Face and multiple industry partners.
📝 Abstract
Large language models (LLMs) have shown significant general language understanding abilities. However, there has been a scarcity of attempts to assess the logical reasoning capacities of these LLMs, an essential facet of natural language understanding. To encourage further investigation in this area, we introduce GLoRE, a General Logical Reasoning Evaluation platform that not only consolidates diverse datasets but also standardizes them into a unified format suitable for evaluating large language models across zero-shot and few-shot scenarios. Our experimental results show that compared to the performance of humans and supervised fine-tuning models, the logical reasoning capabilities of large reasoning models, such as OpenAI's o1 mini, DeepSeek R1 and QwQ-32B, have seen remarkable improvements, with QwQ-32B achieving the highest benchmark performance to date. GLoRE is designed as a living project that continuously integrates new datasets and models, facilitating robust and comparative assessments of model performance in both commercial and Huggingface communities.
Problem

Research questions and friction points this paper is trying to address.

Evaluating logical reasoning in large language models
Standardizing diverse datasets for unified assessment
Comparing model performance across zero-shot and few-shot scenarios
Innovation

Methods, ideas, or system contributions that make the work stand out.

GLoRE standardizes diverse datasets for evaluation
Evaluates LLMs in zero-shot and few-shot scenarios
Continuously integrates new datasets and models
🔎 Similar Papers
No similar papers found.