The Anatomy of an Edit: Mechanism-Guided Activation Steering for Knowledge Editing

📅 2026-03-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
When large language models serve as knowledge bases, their internal mechanisms for knowledge editing remain poorly understood. This work employs Neuron-Level Knowledge Attribution (NLKA) to analyze the internal differences between successful and failed edits, revealing how attention and feed-forward networks jointly facilitate the injection of new knowledge and suppression of outdated information. For the first time, attribution results are translated into actionable engineering signals, leading to MEGA—a lightweight, architecture-agnostic activation intervention method that requires no weight modifications. MEGA explicitly identifies where and how edits take effect within the model and demonstrates superior performance over existing approaches on both GPT2-XL and LLaMA2-7B across the CounterFact and Popular datasets, enabling efficient and reliable knowledge updates.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) are increasingly used as knowledge bases, but keeping them up to date requires targeted knowledge editing (KE). However, it remains unclear how edits are implemented inside the model once applied. In this work, we take a mechanistic view of KE using neuron-level knowledge attribution (NLKA). Unlike prior work that focuses on pre-edit causal tracing and localization, we use post-edit attribution -- contrasting successful and failed edits -- to isolate the computations that shift when an edit succeeds. Across representative KE methods, we find a consistent pattern: mid-to-late attention predominantly promotes the new target, while attention and FFN modules cooperate to suppress the original fact. Motivated by these findings, we propose MEGA, a MEchanism-Guided Activation steering method that performs attention-residual interventions in attribution-aligned regions without modifying model weights. On CounterFact and Popular, MEGA achieves strong editing performance across KE metrics on GPT2-XL and LLaMA2-7B. Overall, our results elevate post-edit attribution from analysis to engineering signal: by pinpointing where and how edits take hold, it powers MEGA to deliver reliable, architecture-agnostic knowledge edits.
Problem

Research questions and friction points this paper is trying to address.

knowledge editing
large language models
mechanistic interpretability
neuron-level attribution
model updating
Innovation

Methods, ideas, or system contributions that make the work stand out.

knowledge editing
mechanistic interpretability
neuron-level attribution
activation steering
MEGA
🔎 Similar Papers
No similar papers found.