Reverse-Engineering Model Editing on Language Models

📅 2026-02-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses a critical privacy vulnerability in large language model editing: parameter updates during localized editing inadvertently leak sensitive edit data, posing significant privacy risks. To exploit this side channel, the authors propose KSTER, a two-stage inversion attack framework that first leverages the low-rank structure of parameter updates to recover the edited subject via spectral analysis and then reconstructs the original prompt semantics through entropy minimization. This study is the first to demonstrate that parameter updates in model editing serve as an effective side channel for data leakage and introduces a novel recovery mechanism combining subspace fingerprinting with entropy-based optimization. Furthermore, the authors design a subspace obfuscation defense strategy that substantially mitigates information leakage while preserving editing utility. Experiments across multiple mainstream large language models show that KSTER achieves high success rates in data recovery, whereas the proposed defense effectively reduces privacy risks without compromising edit performance.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) are pretrained on corpora containing trillions of tokens and, therefore, inevitably memorize sensitive information. Locate-then-edit methods, as a mainstream paradigm of model editing, offer a promising solution by modifying model parameters without retraining. However, in this work, we reveal a critical vulnerability of this paradigm: the parameter updates inadvertently serve as a side channel, enabling attackers to recover the edited data. We propose a two-stage reverse-engineering attack named \textit{KSTER} (\textbf{K}ey\textbf{S}paceRecons\textbf{T}ruction-then-\textbf{E}ntropy\textbf{R}eduction) that leverages the low-rank structure of these updates. First, we theoretically show that the row space of the update matrix encodes a ``fingerprint"of the edited subjects, enabling accurate subject recovery via spectral analysis. Second, we introduce an entropy-based prompt recovery attack that reconstructs the semantic context of the edit. Extensive experiments on multiple LLMs demonstrate that our attacks can recover edited data with high success rates. Furthermore, we propose \textit{subspace camouflage}, a defense strategy that obfuscates the update fingerprint with semantic decoys. This approach effectively mitigates reconstruction risks without compromising editing utility. Our code is available at https://github.com/reanatom/EditingAtk.git.
Problem

Research questions and friction points this paper is trying to address.

model editing
side channel
parameter update
data recovery
language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

model editing
reverse-engineering attack
low-rank update
spectral analysis
subspace camouflage
🔎 Similar Papers
No similar papers found.
Z
Zhiyu Sun
Shanghai Qi Zhi Institute; Software Engineering Institute, East China Normal University, Shanghai, China
M
Minrui Luo
Institute for Interdisciplinary Information Sciences, Tsinghua University; Shanghai Qi Zhi Institute
Y
Yu Wang
Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China
Zhili Chen
Zhili Chen
East China Normal University
Differential PrivacySecure Multiparty ComputationFederated Learning
Tianxing He
Tianxing He
Tsinghua University
NLP