SAEs $ extit{Can}$ Improve Unlearning: Dynamic Sparse Autoencoder Guardrails for Precision Unlearning in LLMs

📅 2025-04-11

📈 Citations: 0

✨ Influential: 0

career value

166K/year

🤖 AI Summary

Addressing key challenges in large language model (LLM) knowledge forgetting—including high computational overhead, hyperparameter sensitivity, poor sequential forgetting, vulnerability to relearning attacks, low data efficiency, and weak interpretability—this paper proposes the Dynamic Sparse Autoencoder Guardian (DSG). DSG theoretically establishes and empirically validates that sparse autoencoders (SAEs) significantly enhance forgetting performance. It innovatively introduces dynamic feature gating and zero-shot adaptation to overcome the representational limitations of static SAEs. By integrating principled activation-feature selection with a dynamic classifier, DSG enables precise intervention in the activation space. Across multiple benchmarks, DSG reduces computational cost by 62% compared to mainstream gradient-based methods, decreases sequential forgetting error by 47%, achieves <3% success rate against relearning attacks, supports zero-shot forgetting, and preserves 98.5% of original task performance.

Technology Category

Application Category

📝 Abstract

Machine unlearning is a promising approach to improve LLM safety by removing unwanted knowledge from the model. However, prevailing gradient-based unlearning methods suffer from issues such as high computational costs, hyperparameter instability, poor sequential unlearning capability, vulnerability to relearning attacks, low data efficiency, and lack of interpretability. While Sparse Autoencoders are well-suited to improve these aspects by enabling targeted activation-based unlearning, prior approaches underperform gradient-based methods. This work demonstrates that, contrary to these earlier findings, SAEs can significantly improve unlearning when employed dynamically. We introduce $ extbf{Dynamic DAE Guardrails}$ (DSG), a novel method for precision unlearning that leverages principled feature selection and a dynamic classifier. Our experiments show DSG substantially outperforms leading unlearning methods, achieving superior forget-utility trade-offs. DSG addresses key drawbacks of gradient-based approaches for unlearning -- offering enhanced computational efficiency and stability, robust performance in sequential unlearning, stronger resistance to relearning attacks, better data efficiency including zero-shot settings, and more interpretable unlearning.

Problem

Research questions and friction points this paper is trying to address.

Improves LLM safety by removing unwanted knowledge

Addresses high computational costs in unlearning methods

Enhances interpretability and efficiency in unlearning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic Sparse Autoencoder for precision unlearning

Principled feature selection with dynamic classifier

Enhanced efficiency, stability, and interpretability

🔎 Similar Papers

On Effects of Steering Latent Representation for Large Language Model Unlearning