Has this Fact been Edited? Detecting Knowledge Edits in Language Models

📅 2024-05-04

📈 Citations: 2

✨ Influential: 1

career value

193K/year

🤖 AI Summary

This paper introduces and formally defines the novel task of *knowledge editing detection*, aiming to determine whether a large language model’s (LLM) factual output stems from its original pretraining knowledge or from subsequent human-driven knowledge edits (e.g., ROME, MEMIT). To address the core challenge—high semantic similarity between edited and unedited knowledge—we propose a lightweight AdaBoost classifier that jointly leverages hidden-layer representations and output probability distributions, achieving high detection accuracy with minimal labeled data. Our contributions are threefold: (1) a formal problem formulation of *knowledge editing detectability*; (2) an in-depth analysis revealing the fundamental difficulty in distinguishing edited from unedited knowledge; and (3) a robust, generalizable detection baseline validated across diverse editing methods, LLM architectures (Llama-2, GPT-J), and domains—demonstrating strong cross-domain stability.

Technology Category

Application Category

📝 Abstract

Knowledge editing methods (KEs) can update language models' obsolete or inaccurate knowledge learned from pre-training. However, KEs can be used for malicious applications, e.g., inserting misinformation and toxic content. Knowing whether a generated output is based on edited knowledge or first-hand knowledge from pre-training can increase users' trust in generative models and provide more transparency. Driven by this, we propose a novel task: detecting edited knowledge in language models. Given an edited model and a fact retrieved by a prompt from an edited model, the objective is to classify the knowledge as either unedited (based on the pre-training), or edited (based on subsequent editing). We instantiate the task with four KEs, two LLMs, and two datasets. Additionally, we propose using the hidden state representations and the probability distributions as features for the detection. Our results reveal that, using these features as inputs to a simple AdaBoost classifiers establishes a strong baseline. This classifier requires only a limited amount of data and maintains its performance even in cross-domain settings. Last, we find it more challenging to distinguish edited knowledge from unedited but related knowledge, highlighting the need for further research. Our work lays the groundwork for addressing malicious model editing, which is a critical challenge associated with the strong generative capabilities of LLMs.

Problem

Research questions and friction points this paper is trying to address.

Detecting edited knowledge in language models.

Classifying knowledge as edited or unedited.

Addressing malicious applications of knowledge editing.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Detecting edited knowledge in models

Using hidden state representations

AdaBoost classifier for detection

🔎 Similar Papers

GenAudit: Fixing Factual Errors in Language Model Outputs with Evidence