Does Editing Provide Evidence for Localization?

📅 2025-02-17

📈 Citations: 2

✨ Influential: 0

career value

207K/year

🤖 AI Summary

This work challenges the implicit assumption in interpretability research that successful local representation editing implies semantic behavior is locally encoded. We propose a gradient-driven optimal local editing framework tailored for LLM alignment, systematically evaluating the relationship between edited locations and semantic localization across multiple tasks. Our results show that optimal local edits applied to randomly selected neurons or layers achieve behavioral alignment comparable to full-model fine-tuning; moreover, strong editing responses frequently coincide with actual semantic misalignment. This study provides the first systematic evidence that local editing does not constitute sufficient proof of semantic localization—thereby questioning the foundational validity of current editing-based interpretability paradigms—and establishes stricter validation criteria for causal attribution methods.

Technology Category

Application Category

📝 Abstract

A basic aspiration for interpretability research in large language models is to"localize"semantically meaningful behaviors to particular components within the LLM. There are various heuristics for finding candidate locations within the LLM. Once a candidate localization is found, it can be assessed by editing the internal representations at the corresponding localization and checking whether this induces model behavior that is consistent with the semantic interpretation of the localization. The question we address here is: how strong is the evidence provided by such edits? To assess localization, we want to assess the effect of the optimal intervention at a particular location. The key new technical tool is a way of adapting LLM alignment techniques to find such optimal localized edits. With this tool in hand, we give an example where the edit-based evidence for localization appears strong, but where localization clearly fails. Indeed, we find that optimal edits at random localizations can be as effective as aligning the full model. In aggregate, our results suggest that merely observing that localized edits induce targeted changes in behavior provides little to no evidence that these locations actually encode the target behavior.

Problem

Research questions and friction points this paper is trying to address.

Assess evidence from editing for localization

Find optimal localized edits using LLM alignment

Evaluate effectiveness of edits at random locations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adapts LLM alignment techniques

Optimal localized edits assessment

Tests random localizations effectiveness

🔎 Similar Papers

Knowledge Localization: Mission Not Accomplished? Enter Query Localization!