OneEval: Benchmarking LLM Knowledge-intensive Reasoning over Diverse Knowledge Bases

📅 2025-06-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) excel at unstructured text reasoning but exhibit significant limitations in integrating structured external knowledge—such as knowledge graphs, code, and formal logic—and lack unified, cross-modal evaluation benchmarks. Method: We introduce OneEval, the first knowledge-intensive multimodal reasoning benchmark, covering four structured modalities (text, knowledge graphs, code, and formal logic) across five domains: general, governmental, scientific, legal, and programming. It features a rigorously curated, human-annotated dataset of 4,019 instances, a multidimensional evaluation protocol, and a challenging subset—OneEval_Hard—designed to expose structural and reasoning complexity. Contribution/Results: Experiments reveal that state-of-the-art models achieve only 32.2% accuracy on OneEval_Hard, with formal logic reasoning (25%) markedly underperforming text-based reasoning (53%). We systematically identify nonlinear degradation in performance with increasing structural complexity and reasoning chain length. The full dataset, evaluation scripts, baseline results, and a live leaderboard are publicly released.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) have demonstrated substantial progress on reasoning tasks involving unstructured text, yet their capabilities significantly deteriorate when reasoning requires integrating structured external knowledge such as knowledge graphs, code snippets, or formal logic. This limitation is partly due to the absence of benchmarks capable of systematically evaluating LLM performance across diverse structured knowledge modalities. To address this gap, we introduce extbf{ extsc{OneEval}}, a comprehensive benchmark explicitly designed to assess the knowledge-intensive reasoning capabilities of LLMs across four structured knowledge modalities, unstructured text, knowledge graphs, code, and formal logic, and five critical domains (general knowledge, government, science, law, and programming). extsc{OneEval} comprises 4,019 carefully curated instances and includes a challenging subset, extsc{OneEval} extsubscript{Hard}, consisting of 1,285 particularly difficult cases. Through extensive evaluation of 18 state-of-the-art open-source and proprietary LLMs, we establish three core findings: a) emph{persistent limitations in structured reasoning}, with even the strongest model achieving only 32.2% accuracy on extsc{OneEval} extsubscript{Hard}; b) emph{performance consistently declines as the structural complexity of the knowledge base increases}, with accuracy dropping sharply from 53% (textual reasoning) to 25% (formal logic); and c) emph{diminishing returns from extended reasoning chains}, highlighting the critical need for models to adapt reasoning depth appropriately to task complexity. We release the extsc{OneEval} datasets, evaluation scripts, and baseline results publicly, accompanied by a leaderboard to facilitate ongoing advancements in structured knowledge reasoning.
Problem

Research questions and friction points this paper is trying to address.

Assessing LLM reasoning across diverse structured knowledge bases
Evaluating performance decline with increasing knowledge structural complexity
Addressing diminishing returns from extended reasoning chains in LLMs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Comprehensive benchmark for structured knowledge reasoning
Evaluates LLMs across diverse knowledge modalities
Includes challenging subset for difficult cases
🔎 Similar Papers
No similar papers found.
Y
Yongrui Chen
Southeast University, China; Key Laboratory of New Generation Artificial Intelligence Technology and its Interdisciplinary Applications (Southeast University), Ministry of Education, China
Z
Zhiqiang Liu
Zhejiang University, China
Jing Yu
Jing Yu
Northwestern University
SustainabilityLife Cycle AnalysisTransportation ManagementOperations Research
L
Lin Ren
Southeast University, China; Key Laboratory of New Generation Artificial Intelligence Technology and its Interdisciplinary Applications (Southeast University), Ministry of Education, China
N
Nan Hu
Southeast University, China; Key Laboratory of New Generation Artificial Intelligence Technology and its Interdisciplinary Applications (Southeast University), Ministry of Education, China
Xinbang Dai
Xinbang Dai
Southeast university
Question AnsweringLLM
J
Jiajun Liu
Southeast University, China; Key Laboratory of New Generation Artificial Intelligence Technology and its Interdisciplinary Applications (Southeast University), Ministry of Education, China
J
Jiazhen Kang
Southeast University, China; Key Laboratory of New Generation Artificial Intelligence Technology and its Interdisciplinary Applications (Southeast University), Ministry of Education, China
Shenyu Zhang
Shenyu Zhang
Southeast University
Natural Language Processing
Xinda Wang
Xinda Wang
University of Texas at Dallas
Software SecurityAI SecuritySystems Security
K
Keyan Ding
Zhejiang University, China
P
Pengfei Shen
Nanjing University of Posts and Telecommunications, China
H
Haolei Zhu
Nanjing University of Posts and Telecommunications, China
H
Hongjie Deng
Zhejiang University, China
Y
Yisong Wang
Tongji University, China
T
Tongtong Wu
Monash University, Australia
Sheng Bi
Sheng Bi
Dalian University of Technology
SemiconductorOrganic Electronics
W
Wen Zhang
Zhejiang University, China
Tianxing Wu
Tianxing Wu
Ph.D. Student, Nanyang technological university
Computer Vision
Q
Qiu Ji
Nanjing University of Posts and Telecommunications, China
Haofen Wang
Haofen Wang
Tongji University
Knowledge GraphNatural Language ProcessingRetrieval Augmented Generation
W
Wenliang Chen
Soochow University, China
H
Huajun Chen
Zhejiang University, China
Guilin Qi
Guilin Qi
Southeast University
Artificial Intelligenceontology