GLoRE: Evaluating Logical Reasoning of Large Language Models

📅 2023-10-13

🏛️ arXiv.org

📈 Citations: 4

✨ Influential: 0

career value

185K/year

🤖 AI Summary

Logical reasoning capability remains a critical yet systematically under-evaluated dimension of natural language understanding, lacking standardized benchmarks and unified evaluation platforms. To address this gap, we introduce GLoRE—a General Logical Reasoning Evaluation platform—featuring the first standardized evaluation framework specifically designed for logical reasoning. GLoRE unifies heterogeneous datasets through format normalization and cross-task alignment, and supports zero-shot and few-shot prompting protocols. It establishes a scalable, continuously updatable “living” evaluation ecosystem to bridge the assessment disparity between commercial and open-source models. Empirical results demonstrate that state-of-the-art reasoning models—e.g., QwQ-32B—achieve substantial gains on GLoRE, surpassing both human baselines and supervised fine-tuned models, thereby setting new SOTA in logical reasoning evaluation. The platform is fully open-sourced and has been adopted by Hugging Face and multiple industry partners.

📝 Abstract

Large language models (LLMs) have shown significant general language understanding abilities. However, there has been a scarcity of attempts to assess the logical reasoning capacities of these LLMs, an essential facet of natural language understanding. To encourage further investigation in this area, we introduce GLoRE, a General Logical Reasoning Evaluation platform that not only consolidates diverse datasets but also standardizes them into a unified format suitable for evaluating large language models across zero-shot and few-shot scenarios. Our experimental results show that compared to the performance of humans and supervised fine-tuning models, the logical reasoning capabilities of large reasoning models, such as OpenAI's o1 mini, DeepSeek R1 and QwQ-32B, have seen remarkable improvements, with QwQ-32B achieving the highest benchmark performance to date. GLoRE is designed as a living project that continuously integrates new datasets and models, facilitating robust and comparative assessments of model performance in both commercial and Huggingface communities.

Problem

Research questions and friction points this paper is trying to address.

Evaluating logical reasoning in large language models

Standardizing diverse datasets for unified assessment

Comparing model performance across zero-shot and few-shot scenarios

Innovation

Methods, ideas, or system contributions that make the work stand out.

GLoRE standardizes diverse datasets for evaluation

Evaluates LLMs in zero-shot and few-shot scenarios

Continuously integrates new datasets and models

🔎 Similar Papers

No similar papers found.