CORE: Comprehensive Ontological Relation Evaluation for Large Language Models

📅 2026-02-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses a critical gap in current large language model (LLM) evaluation frameworks: their inability to assess models’ capacity to discern semantic irrelevance, which is essential for understanding true semantic boundaries. To this end, we propose CORE, a novel evaluation framework that systematically introduces matched irrelevant pairs, yielding a large-scale dataset of 225,000 multiple-choice questions spanning 74 disciplines and an open-source benchmark of 203 items covering 24 semantic relations. Through expert validation (Cohen’s Kappa = 1.0), human baseline comparisons, and calibration error analysis, we reveal that leading LLMs achieve only 0–41.35% accuracy on irrelevant pairs, with a semantic collapse rate of 37.6%—substantially below the human baseline of 92.6%. This study establishes irrelevance reasoning as a crucial new dimension for LLM evaluation and safety.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) perform well on many reasoning benchmarks, yet existing evaluations rarely assess their ability to distinguish between meaningful semantic relations and genuine unrelatedness. We introduce CORE (Comprehensive Ontological Relation Evaluation), a dataset of 225K multiple-choice questions spanning 74 disciplines, together with a general-domain open-source benchmark of 203 rigorously validated questions (Cohen's Kappa = 1.0) covering 24 semantic relation types with equal representation of unrelated pairs. A human baseline from 1,000+ participants achieves 92.6% accuracy (95.1% on unrelated pairs). In contrast, 29 state-of-the-art LLMs achieve 48.25-70.9% overall accuracy, with near-ceiling performance on related pairs (86.5-100%) but severe degradation on unrelated pairs (0-41.35%), despite assigning similar confidence (92-94%). Expected Calibration Error increases 2-4x on unrelated pairs, and a mean semantic collapse rate of 37.6% indicates systematic generation of spurious relations. On the CORE 225K MCQs dataset, accuracy further drops to approximately 2%, highlighting substantial challenges in domain-specific semantic reasoning. We identify unrelatedness reasoning as a critical, under-evaluated frontier for LLM evaluation and safety.
Problem

Research questions and friction points this paper is trying to address.

semantic relations
unrelatedness reasoning
large language models
ontological evaluation
spurious relations
Innovation

Methods, ideas, or system contributions that make the work stand out.

ontological relation evaluation
unrelatedness reasoning
semantic collapse
large language model benchmarking
calibration error
🔎 Similar Papers
No similar papers found.
Satyam Dwivedi
Satyam Dwivedi
Ericsson research, Stockholm, Sweden
PositioningPropagationSignal ProcessingWireless experiments
S
Sanjukta Ghosh
IIT BHU, Varanasi
S
Shivam Dwivedi
IIT BHU, Varanasi
N
Nishi Kumari
Vaikhari AI, Bangalore
A
Anil Thakur
IIT BHU, Varanasi
A
Anurag Purushottam
Vaikhari AI, Bangalore
Deepak Alok
Deepak Alok
Indian Institute of Technology Delhi
Generative syntax and its interaction with semantics and pragmatics. Natural Language Processing.
P
Praveen Gatla
BHU, Varanasi
M
Manjuprasad B
GSSSIETW, Mysore
B
Bipasha Patgiri
Tezpur University, Assam