From 0-Order Selection to 2-Order Judgment: Combinatorial Hardening Exposes Compositional Failures in Frontier LLMs

📅 2026-05-08

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

Existing multiple-choice reasoning benchmarks struggle to effectively evaluate the complex reasoning capabilities of large language models due to rapid model advancements and data contamination. This work proposes LogiHard, a framework that significantly increases logical structural complexity by deterministically transforming zero-order multiple-choice questions into second-order logical judgment problems and incorporating a combinatorial hardening mechanism. Leveraging item response theory (IRT), computerized adaptive testing (CAT), and a 9-dimensional cognitive analysis, the authors construct the LogiHard-2k dataset, achieving a novel leap from superficial complexity to verifiable logical completeness. Evaluation across twelve state-of-the-art models reveals accuracy drops of 31%–56%, while zero-shot transfer to MMLU results in a 47% performance degradation, exposing systematic deficiencies in compositional reasoning and demonstrating the framework’s cross-domain validity and assessment precision.

📝 Abstract

Multiple-choice reasoning benchmarks face dual challenges: rapid saturation from advancing models and data contamination that undermines static evaluations. Ad-hoc hardening methods (paraphrasing, perturbation) attempt to increase difficulty but sacrifice logical validity for surface complexity, falling short to challenge advanced reasoning models. We present LogiHard, a formal framework that deterministically transforms 0-order selection into 2-order logical judgment, which significantly increases the thinking overhead and reasoning steps. The framework integrates Item Response Theory (IRT) for computerized adaptive testing (CAT), enabling precise difficulty control with fewer questions than static benchmarks. We instantiate LogiHard-2k, a logical reasoning dataset constructed by cognitively ranking high-stakes examination questions via 9-dimensional analysis of model thinking traces, followed by combinatorial transformation of high-difficulty items. Evaluation across twelve state-of-the-art models reveals an accuracy degradation ranging from 31% to 56% on combinatorially hardened questions. LLMs suffer from the multi-select failure and early exit bias, which are not shared by human testees. Zero-shot transfer to MMLU demonstrates 47% accuracy degradation (89.84% to 42.86%), confirming applicability across domains with provable validity preservation. The consistent aggregate degeneration is domain-agnostic and stems not from knowledge deficits but from a combinatorial reasoning gap, reflecting a training-induced completeness-verification deficit.

Problem

Research questions and friction points this paper is trying to address.

combinatorial reasoning

logical judgment

multiple-choice reasoning

reasoning gap

compositional failures

Innovation

Methods, ideas, or system contributions that make the work stand out.

combinatorial hardening

logical reasoning

Item Response Theory