Eva-KELLM: A New Benchmark for Evaluating Knowledge Editing of LLMs

📅 2023-08-19

🏛️ arXiv.org

📈 Citations: 39

✨ Influential: 2

career value

173K/year

🤖 AI Summary

This work addresses the limitations of existing neural knowledge editing approaches—namely, their reliance on manually curated triplets and poor adaptability to raw, unstructured documents. To bridge this gap, we introduce DocTER, the first document-level knowledge editing benchmark designed for real-world textual inputs. Methodologically, we propose an LLM-based evaluation framework for document-level editing, incorporating multilingual question generation, knowledge consistency reasoning tests, and cross-lingual transfer protocols. Our key contributions are threefold: (1) the first document-driven knowledge editing benchmark enabling four-dimensional evaluation—editing accuracy, preservation of unrelated knowledge, reasoning generalization, and cross-lingual transfer; (2) the first integration of reasoning consistency and cross-lingual capability into the KELLM evaluation paradigm; and (3) empirical evidence revealing substantial deficiencies of current methods in these higher-order capabilities, alongside the open-sourcing of the first multi-dimensional benchmark dataset for knowledge editing.

📝 Abstract

Large language models (LLMs) possess a wealth of knowledge encoded in their parameters. However, this knowledge may become outdated or unsuitable over time. As a result, there has been a growing interest in knowledge editing for LLMs and evaluating its effectiveness. Existing studies primarily focus on knowledge editing using factual triplets, which not only incur high costs for collection but also struggle to express complex facts. Furthermore, these studies are often limited in their evaluation perspectives. In this paper, we propose Eva-KELLM, a new benchmark for evaluating knowledge editing of LLMs. This benchmark includes an evaluation framework and a corresponding dataset. Under our framework, we first ask the LLM to perform knowledge editing using raw documents, which provides a more convenient and universal approach compared to using factual triplets. We then evaluate the updated LLM from multiple perspectives. In addition to assessing the effectiveness of knowledge editing and the retention of unrelated knowledge from conventional studies, we further test the LLM's ability in two aspects: 1) Reasoning with the altered knowledge, aiming for the LLM to genuinely learn the altered knowledge instead of simply memorizing it. 2) Cross-lingual knowledge transfer, where the LLM updated with raw documents in one language should be capable of handling queries from another language. To facilitate further research, we construct and release the corresponding dataset. Using this benchmark, we investigate the effectiveness of several commonly-used knowledge editing methods. Experimental results indicate that the current methods for knowledge editing using raw documents are not effective in yielding satisfactory results, particularly when it comes to reasoning with altered knowledge and cross-lingual knowledge transfer.

Problem

Research questions and friction points this paper is trying to address.

Evaluating document-based knowledge editing in neural networks

Creating benchmark for editing counterfactual knowledge in documents

Assessing challenges of document vs triple-based knowledge editing

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses documents instead of manual triples

Introduces Extract-then-Edit pipeline

Evaluates four perspectives comprehensively

🔎 Similar Papers

Lifelong Knowledge Editing for LLMs with Retrieval-Augmented Continuous Prompt Learning