COMPKE: Complex Question Answering under Knowledge Editing

📅 2025-06-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing knowledge editing benchmarks inadequately evaluate models’ ability to apply newly injected knowledge in realistic, complex scenarios—such as multi-hop reasoning and one-to-many relational reasoning. Method: We introduce COMPKE, the first knowledge-editing-oriented complex question-answering benchmark, comprising 11,924 real-world, multi-hop, logically compositional questions. Leveraging COMPKE, we conduct the first systematic cross-model evaluation of mainstream editing methods (e.g., MeLLo, ROME), revealing severe performance disparities—e.g., MeLLo’s accuracy varies by over 10× between GPT-4o-mini and Qwen2.5-3B—and perform dual-perspective root-cause analysis grounded in method design and model architecture. We further propose techniques for multi-hop reasoning modeling, logical composition enhancement, and intra-model knowledge consistency verification. Results: Experiments demonstrate critically poor generalization of current editors across architectures and tasks. We publicly release the COMPKE dataset and evaluation framework to advance knowledge editing toward complex reasoning.

Technology Category

Application Category

📝 Abstract
Knowledge Editing, which efficiently modifies the knowledge in large language models, has gathered great attention. Current benchmarks primarily use multi-hop question answering to assess and analyze newly injected or updated knowledge. However, we argue that these benchmarks fail to effectively evaluate how well the updated models apply this knowledge in real-life scenarios, particularly when questions require complex reasoning, involving one-to-many relationships or multi-step logical intersections. To fill in this gap, we introduce a new benchmark, COMPKE: Complex Question Answering under Knowledge Editing, which includes 11,924 complex questions that reflect real-life situations. We conduct an extensive evaluation of four knowledge editing methods on COMPKE, revealing that their effectiveness varies notably across different models. For instance, MeLLo attains an accuracy of 39.47 on GPT-4O-MINI, but this drops sharply to 3.83 on QWEN2.5-3B. We further investigate the underlying causes of these disparities from both methodological and model-specific perspectives. The datasets are available at https://github.com/kzjkzj666/CompKE.
Problem

Research questions and friction points this paper is trying to address.

Evaluating knowledge editing in real-life complex reasoning scenarios
Assessing model performance on one-to-many and multi-step logical questions
Analyzing effectiveness disparities across different knowledge editing methods
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces COMPKE benchmark for complex QA
Evaluates four knowledge editing methods
Analyzes method-model effectiveness disparities
🔎 Similar Papers
No similar papers found.
K
Keyuan Cheng
Provable Responsible AI and Data Analytics (PRADA) Lab, Peking University, South China University of Technology
Z
Zijian Kan
Provable Responsible AI and Data Analytics (PRADA) Lab, South China University of Technology
Z
Zhixian He
Provable Responsible AI and Data Analytics (PRADA) Lab, Sun Yat-sen University
Z
Zhuoran Zhang
Provable Responsible AI and Data Analytics (PRADA) Lab, Peking University
Muhammad Asif Ali
Muhammad Asif Ali
King Abdullah University of Science and Technology
NLPDeep LearningMachine Learning
K
Ke Xu
South China University of Technology
Lijie Hu
Lijie Hu
Assistant Professor, MBZUAI
Explainable AILLMDifferential Privacy
D
Di Wang
Provable Responsible AI and Data Analytics (PRADA) Lab, King Abdullah University of Science and Technology