LLMEval-Logic: A Solver-Verified Chinese Benchmark for Logical Reasoning of LLMs with Adversarial Hardening

📅 2026-05-19

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

Existing Chinese large language model benchmarks for logical reasoning are prone to rapid performance saturation due to template-generated questions and coarse annotations, failing to rigorously assess genuine reasoning capabilities. This work proposes the first Chinese logical reasoning evaluation framework that integrates expert co-authoring, formal verification, and adversarial hardening: natural language problems and their formal representations undergo expert review, answers are validated using the Z3 theorem prover, fine-grained rubric-based scoring rules are designed, and a closed-loop adversarial refinement process enhances difficulty. The released benchmark comprises 246 foundational and 190 challenging problems. Evaluations across 14 state-of-the-art models reveal that even the strongest model achieves only 37.5% accuracy on hard problems, with the highest formal score—combining Z3 verification and rubric assessment—reaching 60.16%, underscoring a significant gap in current models’ capacity for rigorous logical reasoning.

📝 Abstract

Evaluating large language models (LLMs) on natural-language logical reasoning is essential because rule-governed tasks require conclusions to follow strictly from stated premises. Many existing logical-reasoning benchmarks are generated by templating natural-language items from sampled formulas, provide only coarse or unaudited formal annotations, and are now quickly saturated by frontier reasoning models. We present LLMEval-Logic, a Chinese logical reasoning benchmark built from realistic situational scenarios. Its pipeline forward-authors and expert-audits natural-language items together with their reference formalizations, verifies annotated answers with Z3, constructs expert rubrics for natural-to-formal grading, and hardens selected items through a closed-loop adversarial workflow. The benchmark is released in two paired subsets: a 246-item Base subset shipped with 1,400 expert-developed rubric atoms, and a 190-item Hard subset with 938 multi-step sub-questions over closed model spaces. Evaluating 14 frontier LLMs on LLMEval-Logic reveals substantial gaps in current models: the best model reaches only 37.5% Hard Item Accuracy, and even with reference symbols the highest joint Z3+Rubric formalization score among evaluated models reaches only 60.16%. Our benchmark is publicly available at https://github.com/llmeval/LLMEval-Logic.

Problem

Research questions and friction points this paper is trying to address.

logical reasoning

large language models

benchmark

formal verification

adversarial hardening

Innovation

Methods, ideas, or system contributions that make the work stand out.

logical reasoning

solver-verified

adversarial hardening