CLaRE-ty Amid Chaos: Quantifying Representational Entanglement to Predict Ripple Effects in LLM Editing

📅 2026-03-11

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

Editing large language models often triggers unpredictable ripple effects—unintended behavioral changes across unrelated tasks. To address this, this work proposes CLaRE, the first lightweight, gradient-free method that quantifies entanglement among factual representations using forward activations from a single layer, enabling the construction of large-scale factual entanglement graphs. CLaRE facilitates efficient creation of edit-protection sets, audit trails, and red-teaming evaluations. Experimental results demonstrate that CLaRE achieves an average 62.2% improvement in Spearman correlation, accelerates inference by 2.74×, reduces peak GPU memory usage by 2.85×, and substantially lowers storage overhead compared to existing approaches.

Technology Category

Application Category

📝 Abstract

The static knowledge representations of large language models (LLMs) inevitably become outdated or incorrect over time. While model-editing techniques offer a promising solution by modifying a model's factual associations, they often produce unpredictable ripple effects, which are unintended behavioral changes that propagate even to the hidden space. In this work, we introduce CLaRE, a lightweight representation-level technique to identify where these ripple effects may occur. Unlike prior gradient-based methods, CLaRE quantifies entanglement between facts using forward activations from a single intermediate layer, avoiding costly backward passes. To enable systematic study, we prepare and analyse a corpus of 11,427 facts drawn from three existing datasets. Using CLaRE, we compute large-scale entanglement graphs of this corpus for multiple models, capturing how local edits propagate through representational space. These graphs enable stronger preservation sets for model editing, audit trails, efficient red-teaming, and scalable post-edit evaluation. In comparison to baselines, CLaRE achieves an average of 62.2% improvement in Spearman correlation with ripple effects while being $2.74\times$ faster, and using $2.85\times$ less peak GPU memory. Besides, CLaRE requires only a fraction of the storage needed by the baselines to compute and preserve fact representations. Our entanglement graphs and corpus are available at https://anonymous.4open.science/r/CLaRE-488E.

Problem

Research questions and friction points this paper is trying to address.

large language models

model editing

ripple effects

representational entanglement

knowledge updating

Innovation

Methods, ideas, or system contributions that make the work stand out.

representational entanglement

model editing

ripple effects