π€ AI Summary
This work addresses the challenge of knowledge editing in large language models by identifying an optimal editing layer that enables precise modification of specific knowledge while minimizing interference with other model behaviors. The study provides the first empirical validation of a generalizable βgolden layerβ that approximates instance-level optimal editing performance. To efficiently locate this layer without extensive trial-and-error or numerous editing trials, the authors propose Layer-wise Gradient Attribution (LGA), a gradient-based attribution method. LGA integrates proxy dataset evaluation with cross-dataset generalization strategies and is compatible with multiple mainstream editing algorithms. Experiments across diverse benchmarks demonstrate that LGA significantly improves both editing efficiency and success rates, and it generalizes effectively across different model architectures.
π Abstract
Knowledge editing in Large Language Models (LLMs) aims to update the model's prediction for a specific query to a desired target while preserving its behavior on all other inputs. This process typically involves two stages: identifying the layer to edit and performing the parameter update. Intuitively, different queries may localize knowledge at different depths of the model, resulting in different sample-wise editing performance for a fixed editing layer. In this work, we hypothesize the existence of fixed golden layers that can achieve near-optimal editing performance similar to sample-wise optimal layers. To validate this hypothesis, we provide empirical evidence by comparing golden layers against ground-truth sample-wise optimal layers. Furthermore, we show that golden layers can be reliably identified using a proxy dataset and generalize effectively to unseen test set queries across datasets. Finally, we propose a novel method, namely Layer Gradient Analysis (LGA) that estimates golden layers efficiently via gradient-attribution, avoiding extensive trial-and-error across multiple editing runs. Extensive experiments on several benchmark datasets demonstrate the effectiveness and robustness of our LGA approach across different LLM types and various knowledge editing methods.