Reverse-Engineering Model Editing on Language Models

📅 2026-02-07

📈 Citations: 0

✨ Influential: 0

career value

158K/year

🤖 AI Summary

This work addresses a critical privacy vulnerability in large language model editing: parameter updates during localized editing inadvertently leak sensitive edit data, posing significant privacy risks. To exploit this side channel, the authors propose KSTER, a two-stage inversion attack framework that first leverages the low-rank structure of parameter updates to recover the edited subject via spectral analysis and then reconstructs the original prompt semantics through entropy minimization. This study is the first to demonstrate that parameter updates in model editing serve as an effective side channel for data leakage and introduces a novel recovery mechanism combining subspace fingerprinting with entropy-based optimization. Furthermore, the authors design a subspace obfuscation defense strategy that substantially mitigates information leakage while preserving editing utility. Experiments across multiple mainstream large language models show that KSTER achieves high success rates in data recovery, whereas the proposed defense effectively reduces privacy risks without compromising edit performance.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) are pretrained on corpora containing trillions of tokens and, therefore, inevitably memorize sensitive information. Locate-then-edit methods, as a mainstream paradigm of model editing, offer a promising solution by modifying model parameters without retraining. However, in this work, we reveal a critical vulnerability of this paradigm: the parameter updates inadvertently serve as a side channel, enabling attackers to recover the edited data. We propose a two-stage reverse-engineering attack named \textit{KSTER} (\textbf{K}ey\textbf{S}paceRecons\textbf{T}ruction-then-\textbf{E}ntropy\textbf{R}eduction) that leverages the low-rank structure of these updates. First, we theoretically show that the row space of the update matrix encodes a ``fingerprint"of the edited subjects, enabling accurate subject recovery via spectral analysis. Second, we introduce an entropy-based prompt recovery attack that reconstructs the semantic context of the edit. Extensive experiments on multiple LLMs demonstrate that our attacks can recover edited data with high success rates. Furthermore, we propose \textit{subspace camouflage}, a defense strategy that obfuscates the update fingerprint with semantic decoys. This approach effectively mitigates reconstruction risks without compromising editing utility. Our code is available at https://github.com/reanatom/EditingAtk.git.

Problem

Research questions and friction points this paper is trying to address.

model editing

side channel

parameter update

data recovery

language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

model editing

reverse-engineering attack

low-rank update