LLMEval-Logic: A Solver-Verified Chinese Benchmark for Logical Reasoning of LLMs with Adversarial Hardening

📅 2026-05-19
📈 Citations: 0
Influential: 0
📄 PDF

career value

204K/year
🤖 AI Summary
Existing Chinese large language model benchmarks for logical reasoning are prone to rapid performance saturation due to template-generated questions and coarse annotations, failing to rigorously assess genuine reasoning capabilities. This work proposes the first Chinese logical reasoning evaluation framework that integrates expert co-authoring, formal verification, and adversarial hardening: natural language problems and their formal representations undergo expert review, answers are validated using the Z3 theorem prover, fine-grained rubric-based scoring rules are designed, and a closed-loop adversarial refinement process enhances difficulty. The released benchmark comprises 246 foundational and 190 challenging problems. Evaluations across 14 state-of-the-art models reveal that even the strongest model achieves only 37.5% accuracy on hard problems, with the highest formal score—combining Z3 verification and rubric assessment—reaching 60.16%, underscoring a significant gap in current models’ capacity for rigorous logical reasoning.
📝 Abstract
Evaluating large language models (LLMs) on natural-language logical reasoning is essential because rule-governed tasks require conclusions to follow strictly from stated premises. Many existing logical-reasoning benchmarks are generated by templating natural-language items from sampled formulas, provide only coarse or unaudited formal annotations, and are now quickly saturated by frontier reasoning models. We present LLMEval-Logic, a Chinese logical reasoning benchmark built from realistic situational scenarios. Its pipeline forward-authors and expert-audits natural-language items together with their reference formalizations, verifies annotated answers with Z3, constructs expert rubrics for natural-to-formal grading, and hardens selected items through a closed-loop adversarial workflow. The benchmark is released in two paired subsets: a 246-item Base subset shipped with 1,400 expert-developed rubric atoms, and a 190-item Hard subset with 938 multi-step sub-questions over closed model spaces. Evaluating 14 frontier LLMs on LLMEval-Logic reveals substantial gaps in current models: the best model reaches only 37.5% Hard Item Accuracy, and even with reference symbols the highest joint Z3+Rubric formalization score among evaluated models reaches only 60.16%. Our benchmark is publicly available at https://github.com/llmeval/LLMEval-Logic.
Problem

Research questions and friction points this paper is trying to address.

logical reasoning
large language models
benchmark
formal verification
adversarial hardening
Innovation

Methods, ideas, or system contributions that make the work stand out.

logical reasoning
solver-verified
adversarial hardening
formalization rubric
Chinese benchmark
🔎 Similar Papers
No similar papers found.
Ming Zhang
Ming Zhang
复旦大学计算机科学技术学院
LLM
Q
Qiyuan Peng
Institute of Trustworthy Embodied Artificial Intelligence, Fudan University
Y
Yinxi Wei
Institute of Trustworthy Embodied Artificial Intelligence, Fudan University
Y
Yujiong Shen
Institute of Trustworthy Embodied Artificial Intelligence, Fudan University
K
Kexin Tan
Institute of Trustworthy Embodied Artificial Intelligence, Fudan University
Y
Yuhui Wang
Institute of Trustworthy Embodied Artificial Intelligence, Fudan University
Z
Zhenghao Xiang
Institute of Trustworthy Embodied Artificial Intelligence, Fudan University
Junjie Ye
Junjie Ye
Fudan University
Computer ScienceNatural Language ProcessingLarge Language ModelsTool Learning
Z
Zhangyue Yin
Institute of Trustworthy Embodied Artificial Intelligence, Fudan University
Zhiheng Xi
Zhiheng Xi
Fudan University
LLM ReasoningLLM-based Agents
Shihan Dou
Shihan Dou
Fudan University
LLMsCode LMsRLAlignment
T
Tao Gui
Institute of Trustworthy Embodied Artificial Intelligence, Fudan University
M
Maxm Pan
Hunyuan Team, Tencent
R
Ruizhi Yang
School of Philosophy, Fudan University
Qi Zhang
Qi Zhang
Fudan University
SAGINsatellite routing
X
Xuanjing Huang
Institute of Trustworthy Embodied Artificial Intelligence, Fudan University