🤖 AI Summary
This work addresses a critical gap in current large language model (LLM) evaluation frameworks: their inability to assess models’ capacity to discern semantic irrelevance, which is essential for understanding true semantic boundaries. To this end, we propose CORE, a novel evaluation framework that systematically introduces matched irrelevant pairs, yielding a large-scale dataset of 225,000 multiple-choice questions spanning 74 disciplines and an open-source benchmark of 203 items covering 24 semantic relations. Through expert validation (Cohen’s Kappa = 1.0), human baseline comparisons, and calibration error analysis, we reveal that leading LLMs achieve only 0–41.35% accuracy on irrelevant pairs, with a semantic collapse rate of 37.6%—substantially below the human baseline of 92.6%. This study establishes irrelevance reasoning as a crucial new dimension for LLM evaluation and safety.
📝 Abstract
Large Language Models (LLMs) perform well on many reasoning benchmarks, yet existing evaluations rarely assess their ability to distinguish between meaningful semantic relations and genuine unrelatedness. We introduce CORE (Comprehensive Ontological Relation Evaluation), a dataset of 225K multiple-choice questions spanning 74 disciplines, together with a general-domain open-source benchmark of 203 rigorously validated questions (Cohen's Kappa = 1.0) covering 24 semantic relation types with equal representation of unrelated pairs. A human baseline from 1,000+ participants achieves 92.6% accuracy (95.1% on unrelated pairs). In contrast, 29 state-of-the-art LLMs achieve 48.25-70.9% overall accuracy, with near-ceiling performance on related pairs (86.5-100%) but severe degradation on unrelated pairs (0-41.35%), despite assigning similar confidence (92-94%). Expected Calibration Error increases 2-4x on unrelated pairs, and a mean semantic collapse rate of 37.6% indicates systematic generation of spurious relations. On the CORE 225K MCQs dataset, accuracy further drops to approximately 2%, highlighting substantial challenges in domain-specific semantic reasoning. We identify unrelatedness reasoning as a critical, under-evaluated frontier for LLM evaluation and safety.