🤖 AI Summary
Existing evaluation methods struggle to disentangle domain knowledge from reasoning, leading to biased measurements of pure reasoning capability. To address this, we propose the Knowledge-Orthogonal Reasoning (KOR) paradigm and introduce KOR-Bench—the first benchmark explicitly designed to assess reasoning in isolation from domain-specific knowledge. KOR-Bench comprises five task categories: operations, logic, ciphers, puzzles, and counterfactuals, evaluated via a rule-driven framework enabling dynamic rule injection and cross-task generalization analysis. Methodologically, we integrate Stepwise Prompting, two-round Self-Correction, rule-focused attention visualization, and multi-scale few-shot evaluation. Experiments show that O1-Preview and O1-Mini achieve 72.88% and 70.16% accuracy, respectively—significantly outperforming Claude-3.5-Sonnet (58.96%) and GPT-4o (58.00%). Notably, cipher tasks reveal a critical bottleneck across current models, underscoring the need for improved symbolic and constraint-based reasoning.
📝 Abstract
In this paper, we introduce Knowledge-Orthogonal Reasoning (KOR), a concept aimed at minimizing reliance on domain-specific knowledge, enabling more accurate evaluation of models' reasoning abilities in out-of-distribution settings. Based on this concept, we propose the Knowledge-Orthogonal Reasoning Benchmark (KOR-Bench), encompassing five task categories: Operation, Logic, Cipher, Puzzle, and Counterfactual. KOR-Bench emphasizes models' effectiveness in applying new rule descriptions to solve novel rule-driven questions. O1-Preview and O1-Mini achieve accuracies of 72.88% and 70.16%, surpassing Claude-3.5-Sonnet and GPT-4o (58.96% and 58.00%), highlighting the effectiveness of KOR-Bench. We perform detailed analyses, identifying bottlenecks in the Cipher task with Stepwise Prompting, where two rounds of Self-Correction yield optimal results. We evaluate performance across three integrated tasks, explore the impact of Tricks on the Puzzle task, and visualize rule-focused attention. Additionally, we conduct an ablation study on dataset size, benchmark correlations, and zero-shot and three-shot"only questions"experiments. KOR-Bench aims to enhance reasoning evaluation and support further research in this area.