OneEval: Benchmarking LLM Knowledge-intensive Reasoning over Diverse Knowledge Bases

📅 2025-06-14

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

Large language models (LLMs) excel at unstructured text reasoning but exhibit significant limitations in integrating structured external knowledge—such as knowledge graphs, code, and formal logic—and lack unified, cross-modal evaluation benchmarks. Method: We introduce OneEval, the first knowledge-intensive multimodal reasoning benchmark, covering four structured modalities (text, knowledge graphs, code, and formal logic) across five domains: general, governmental, scientific, legal, and programming. It features a rigorously curated, human-annotated dataset of 4,019 instances, a multidimensional evaluation protocol, and a challenging subset—OneEval_Hard—designed to expose structural and reasoning complexity. Contribution/Results: Experiments reveal that state-of-the-art models achieve only 32.2% accuracy on OneEval_Hard, with formal logic reasoning (25%) markedly underperforming text-based reasoning (53%). We systematically identify nonlinear degradation in performance with increasing structural complexity and reasoning chain length. The full dataset, evaluation scripts, baseline results, and a live leaderboard are publicly released.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) have demonstrated substantial progress on reasoning tasks involving unstructured text, yet their capabilities significantly deteriorate when reasoning requires integrating structured external knowledge such as knowledge graphs, code snippets, or formal logic. This limitation is partly due to the absence of benchmarks capable of systematically evaluating LLM performance across diverse structured knowledge modalities. To address this gap, we introduce extbf{ extsc{OneEval}}, a comprehensive benchmark explicitly designed to assess the knowledge-intensive reasoning capabilities of LLMs across four structured knowledge modalities, unstructured text, knowledge graphs, code, and formal logic, and five critical domains (general knowledge, government, science, law, and programming). extsc{OneEval} comprises 4,019 carefully curated instances and includes a challenging subset, extsc{OneEval} extsubscript{Hard}, consisting of 1,285 particularly difficult cases. Through extensive evaluation of 18 state-of-the-art open-source and proprietary LLMs, we establish three core findings: a) emph{persistent limitations in structured reasoning}, with even the strongest model achieving only 32.2% accuracy on extsc{OneEval} extsubscript{Hard}; b) emph{performance consistently declines as the structural complexity of the knowledge base increases}, with accuracy dropping sharply from 53% (textual reasoning) to 25% (formal logic); and c) emph{diminishing returns from extended reasoning chains}, highlighting the critical need for models to adapt reasoning depth appropriately to task complexity. We release the extsc{OneEval} datasets, evaluation scripts, and baseline results publicly, accompanied by a leaderboard to facilitate ongoing advancements in structured knowledge reasoning.

Problem

Research questions and friction points this paper is trying to address.

Assessing LLM reasoning across diverse structured knowledge bases

Evaluating performance decline with increasing knowledge structural complexity

Addressing diminishing returns from extended reasoning chains in LLMs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Comprehensive benchmark for structured knowledge reasoning

Evaluates LLMs across diverse knowledge modalities

Includes challenging subset for difficult cases

🔎 Similar Papers

No similar papers found.