🤖 AI Summary
Large language models (LLMs) excel at unstructured text reasoning but exhibit significant limitations in integrating structured external knowledge—such as knowledge graphs, code, and formal logic—and lack unified, cross-modal evaluation benchmarks. Method: We introduce OneEval, the first knowledge-intensive multimodal reasoning benchmark, covering four structured modalities (text, knowledge graphs, code, and formal logic) across five domains: general, governmental, scientific, legal, and programming. It features a rigorously curated, human-annotated dataset of 4,019 instances, a multidimensional evaluation protocol, and a challenging subset—OneEval_Hard—designed to expose structural and reasoning complexity. Contribution/Results: Experiments reveal that state-of-the-art models achieve only 32.2% accuracy on OneEval_Hard, with formal logic reasoning (25%) markedly underperforming text-based reasoning (53%). We systematically identify nonlinear degradation in performance with increasing structural complexity and reasoning chain length. The full dataset, evaluation scripts, baseline results, and a live leaderboard are publicly released.
📝 Abstract
Large Language Models (LLMs) have demonstrated substantial progress on reasoning tasks involving unstructured text, yet their capabilities significantly deteriorate when reasoning requires integrating structured external knowledge such as knowledge graphs, code snippets, or formal logic. This limitation is partly due to the absence of benchmarks capable of systematically evaluating LLM performance across diverse structured knowledge modalities. To address this gap, we introduce extbf{ extsc{OneEval}}, a comprehensive benchmark explicitly designed to assess the knowledge-intensive reasoning capabilities of LLMs across four structured knowledge modalities, unstructured text, knowledge graphs, code, and formal logic, and five critical domains (general knowledge, government, science, law, and programming). extsc{OneEval} comprises 4,019 carefully curated instances and includes a challenging subset, extsc{OneEval} extsubscript{Hard}, consisting of 1,285 particularly difficult cases. Through extensive evaluation of 18 state-of-the-art open-source and proprietary LLMs, we establish three core findings: a) emph{persistent limitations in structured reasoning}, with even the strongest model achieving only 32.2% accuracy on extsc{OneEval} extsubscript{Hard}; b) emph{performance consistently declines as the structural complexity of the knowledge base increases}, with accuracy dropping sharply from 53% (textual reasoning) to 25% (formal logic); and c) emph{diminishing returns from extended reasoning chains}, highlighting the critical need for models to adapt reasoning depth appropriately to task complexity. We release the extsc{OneEval} datasets, evaluation scripts, and baseline results publicly, accompanied by a leaderboard to facilitate ongoing advancements in structured knowledge reasoning.