KOR-Bench: Benchmarking Language Models on Knowledge-Orthogonal Reasoning Tasks

📅 2024-10-09
🏛️ arXiv.org
📈 Citations: 5
Influential: 0
📄 PDF
🤖 AI Summary
Existing evaluation methods struggle to disentangle domain knowledge from reasoning, leading to biased measurements of pure reasoning capability. To address this, we propose the Knowledge-Orthogonal Reasoning (KOR) paradigm and introduce KOR-Bench—the first benchmark explicitly designed to assess reasoning in isolation from domain-specific knowledge. KOR-Bench comprises five task categories: operations, logic, ciphers, puzzles, and counterfactuals, evaluated via a rule-driven framework enabling dynamic rule injection and cross-task generalization analysis. Methodologically, we integrate Stepwise Prompting, two-round Self-Correction, rule-focused attention visualization, and multi-scale few-shot evaluation. Experiments show that O1-Preview and O1-Mini achieve 72.88% and 70.16% accuracy, respectively—significantly outperforming Claude-3.5-Sonnet (58.96%) and GPT-4o (58.00%). Notably, cipher tasks reveal a critical bottleneck across current models, underscoring the need for improved symbolic and constraint-based reasoning.

Technology Category

Application Category

📝 Abstract
In this paper, we introduce Knowledge-Orthogonal Reasoning (KOR), a concept aimed at minimizing reliance on domain-specific knowledge, enabling more accurate evaluation of models' reasoning abilities in out-of-distribution settings. Based on this concept, we propose the Knowledge-Orthogonal Reasoning Benchmark (KOR-Bench), encompassing five task categories: Operation, Logic, Cipher, Puzzle, and Counterfactual. KOR-Bench emphasizes models' effectiveness in applying new rule descriptions to solve novel rule-driven questions. O1-Preview and O1-Mini achieve accuracies of 72.88% and 70.16%, surpassing Claude-3.5-Sonnet and GPT-4o (58.96% and 58.00%), highlighting the effectiveness of KOR-Bench. We perform detailed analyses, identifying bottlenecks in the Cipher task with Stepwise Prompting, where two rounds of Self-Correction yield optimal results. We evaluate performance across three integrated tasks, explore the impact of Tricks on the Puzzle task, and visualize rule-focused attention. Additionally, we conduct an ablation study on dataset size, benchmark correlations, and zero-shot and three-shot"only questions"experiments. KOR-Bench aims to enhance reasoning evaluation and support further research in this area.
Problem

Research questions and friction points this paper is trying to address.

Minimizing reliance on domain-specific knowledge for reasoning evaluation.
Evaluating models' ability to solve novel rule-driven questions.
Identifying bottlenecks and improving reasoning task performance.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces Knowledge-Orthogonal Reasoning (KOR) concept
Proposes KOR-Bench with five task categories
Uses Stepwise Prompting and Self-Correction techniques
🔎 Similar Papers
No similar papers found.
Kaijing Ma
Kaijing Ma
Fudan University
Computer VisionMachine Learning
Xinrun Du
Xinrun Du
Multimodal Art Projection Research Community, 01.ai
LLM
Y
Yunran Wang
École Polytechnique
H
Haoran Zhang
University of Manchester, Multimodal Art Projection Research Community
Zhoufutu Wen
Zhoufutu Wen
ByteDance SEED
LLM Evaluation
X
Xingwei Qu
University of Manchester, Multimodal Art Projection Research Community
J
Jian Yang
Multimodal Art Projection Research Community
J
Jiaheng Liu
2077.AI, Multimodal Art Projection Research Community
M
Minghao Liu
2077.AI, Multimodal Art Projection Research Community
Xiang Yue
Xiang Yue
Carnegie Mellon University
Natural Language ProcessingLarge Language ModelsMachine Learning
W
Wenhao Huang
ByteDance.Inc, 01.AI, Multimodal Art Projection Research Community
G
Ge Zhang
ByteDance.Inc, 01.AI, Multimodal Art Projection Research Community