🤖 AI Summary
Existing Chinese large language model benchmarks for logical reasoning are prone to rapid performance saturation due to template-generated questions and coarse annotations, failing to rigorously assess genuine reasoning capabilities. This work proposes the first Chinese logical reasoning evaluation framework that integrates expert co-authoring, formal verification, and adversarial hardening: natural language problems and their formal representations undergo expert review, answers are validated using the Z3 theorem prover, fine-grained rubric-based scoring rules are designed, and a closed-loop adversarial refinement process enhances difficulty. The released benchmark comprises 246 foundational and 190 challenging problems. Evaluations across 14 state-of-the-art models reveal that even the strongest model achieves only 37.5% accuracy on hard problems, with the highest formal score—combining Z3 verification and rubric assessment—reaching 60.16%, underscoring a significant gap in current models’ capacity for rigorous logical reasoning.
📝 Abstract
Evaluating large language models (LLMs) on natural-language logical reasoning is essential because rule-governed tasks require conclusions to follow strictly from stated premises. Many existing logical-reasoning benchmarks are generated by templating natural-language items from sampled formulas, provide only coarse or unaudited formal annotations, and are now quickly saturated by frontier reasoning models. We present LLMEval-Logic, a Chinese logical reasoning benchmark built from realistic situational scenarios. Its pipeline forward-authors and expert-audits natural-language items together with their reference formalizations, verifies annotated answers with Z3, constructs expert rubrics for natural-to-formal grading, and hardens selected items through a closed-loop adversarial workflow. The benchmark is released in two paired subsets: a 246-item Base subset shipped with 1,400 expert-developed rubric atoms, and a 190-item Hard subset with 938 multi-step sub-questions over closed model spaces. Evaluating 14 frontier LLMs on LLMEval-Logic reveals substantial gaps in current models: the best model reaches only 37.5% Hard Item Accuracy, and even with reference symbols the highest joint Z3+Rubric formalization score among evaluated models reaches only 60.16%. Our benchmark is publicly available at https://github.com/llmeval/LLMEval-Logic.