KOR-Bench: Benchmarking Language Models on Knowledge-Orthogonal Reasoning Tasks

📅 2024-10-09

🏛️ arXiv.org

📈 Citations: 5

✨ Influential: 0

career value

150K/year

🤖 AI Summary

Existing evaluation methods struggle to disentangle domain knowledge from reasoning, leading to biased measurements of pure reasoning capability. To address this, we propose the Knowledge-Orthogonal Reasoning (KOR) paradigm and introduce KOR-Bench—the first benchmark explicitly designed to assess reasoning in isolation from domain-specific knowledge. KOR-Bench comprises five task categories: operations, logic, ciphers, puzzles, and counterfactuals, evaluated via a rule-driven framework enabling dynamic rule injection and cross-task generalization analysis. Methodologically, we integrate Stepwise Prompting, two-round Self-Correction, rule-focused attention visualization, and multi-scale few-shot evaluation. Experiments show that O1-Preview and O1-Mini achieve 72.88% and 70.16% accuracy, respectively—significantly outperforming Claude-3.5-Sonnet (58.96%) and GPT-4o (58.00%). Notably, cipher tasks reveal a critical bottleneck across current models, underscoring the need for improved symbolic and constraint-based reasoning.

Technology Category

Application Category

📝 Abstract

In this paper, we introduce Knowledge-Orthogonal Reasoning (KOR), a concept aimed at minimizing reliance on domain-specific knowledge, enabling more accurate evaluation of models' reasoning abilities in out-of-distribution settings. Based on this concept, we propose the Knowledge-Orthogonal Reasoning Benchmark (KOR-Bench), encompassing five task categories: Operation, Logic, Cipher, Puzzle, and Counterfactual. KOR-Bench emphasizes models' effectiveness in applying new rule descriptions to solve novel rule-driven questions. O1-Preview and O1-Mini achieve accuracies of 72.88% and 70.16%, surpassing Claude-3.5-Sonnet and GPT-4o (58.96% and 58.00%), highlighting the effectiveness of KOR-Bench. We perform detailed analyses, identifying bottlenecks in the Cipher task with Stepwise Prompting, where two rounds of Self-Correction yield optimal results. We evaluate performance across three integrated tasks, explore the impact of Tricks on the Puzzle task, and visualize rule-focused attention. Additionally, we conduct an ablation study on dataset size, benchmark correlations, and zero-shot and three-shot"only questions"experiments. KOR-Bench aims to enhance reasoning evaluation and support further research in this area.

Problem

Research questions and friction points this paper is trying to address.

Minimizing reliance on domain-specific knowledge for reasoning evaluation.

Evaluating models' ability to solve novel rule-driven questions.

Identifying bottlenecks and improving reasoning task performance.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces Knowledge-Orthogonal Reasoning (KOR) concept

Proposes KOR-Bench with five task categories

Uses Stepwise Prompting and Self-Correction techniques

🔎 Similar Papers

No similar papers found.