🤖 AI Summary
Large language models are prone to verbatim memorization of training data, which compromises generalization and raises privacy concerns. Existing global intervention methods often inadvertently impair legitimate generalization capabilities. This work proposes GSS, the first approach to frame memorization mitigation as a gated subspace steering problem. GSS employs a detector to identify memory-related activations and applies sparse, context-aware corrections only when necessary. By integrating optimal subspace steering with a dynamic gating mechanism, the method reveals the geometric structure of memorized information within neural representations. Evaluated on four benchmarks, GSS matches or exceeds state-of-the-art performance while reducing computational overhead by 100–1000× compared to optimization-based methods.
📝 Abstract
Large language models (LLMs) can memorize and reproduce training sequences verbatim -- a tendency that undermines both generalization and privacy. Existing mitigation methods apply interventions uniformly, degrading performance on the majority of tokens that generalize normally. We show empirically that memorization is sparse, intermittent, and token-conditioned, suggesting that effective mitigation requires context-aware intervention rather than static parameter modification. To this end, we propose a novel and effective selective memorization mitigation method -- Gated Subspace Steering (GSS), which decomposes intervention into a probe (detecting memorization-relevant activations) and a steer (applying targeted correction only when the probe exceeds a threshold). The optimal probe-steer pair emerges from a principled optimization framework based on optimal subspace steering. Experiments on four benchmarks show GSS matches or exceeds state-of-the-art memorization reduction while requiring $100-1000 \times$ less compute than optimization-based alternatives. Furthermore, we provide new theoretical insights into the geometry of memorization in neural representations.