GSAE: Graph-Regularized Sparse Autoencoders for Robust LLM Safety Steering

📅 2025-12-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the vulnerability of large language models (LLMs) to adversarial prompts and jailbreaking attacks that elicit harmful outputs, this paper proposes a safety-guidance method based on distributed representation. Departing from conventional unidimensional safety assumptions, we design a graph-regularized sparse autoencoder (SAE) that incorporates Laplacian smoothing penalties derived from neuron co-activation graphs, enabling abstract safety concepts—such as “refusal”—to be encoded as continuous, cross-feature patterns. A two-stage dynamic gating mechanism activates intervention only upon risk detection, balancing safety and usability. The method is architecture-agnostic, compatible with LLaMA-3, Mistral, Qwen, Phi, and others. On standard benchmarks, it achieves an average selective refusal rate of 82%—a 40-percentage-point improvement over vanilla SAE—while maintaining ≥90% refusal rates against GCG and AutoDAN attacks. Crucially, it preserves high task performance: 70% accuracy on TriviaQA, 65% on TruthfulQA, and 74% on GSM8K.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) face critical safety challenges, as they can be manipulated to generate harmful content through adversarial prompts and jailbreak attacks. Many defenses are typically either black-box guardrails that filter outputs, or internals-based methods that steer hidden activations by operationalizing safety as a single latent feature or dimension. While effective for simple concepts, this assumption is limiting, as recent evidence shows that abstract concepts such as refusal and temporality are distributed across multiple features rather than isolated in one. To address this limitation, we introduce Graph-Regularized Sparse Autoencoders (GSAEs), which extends SAEs with a Laplacian smoothness penalty on the neuron co-activation graph. Unlike standard SAEs that assign each concept to a single latent feature, GSAEs recover smooth, distributed safety representations as coherent patterns spanning multiple features. We empirically demonstrate that GSAE enables effective runtime safety steering, assembling features into a weighted set of safety-relevant directions and controlling them with a two-stage gating mechanism that activates interventions only when harmful prompts or continuations are detected during generation. This approach enforces refusals adaptively while preserving utility on benign queries. Across safety and QA benchmarks, GSAE steering achieves an average 82% selective refusal rate, substantially outperforming standard SAE steering (42%), while maintaining strong task accuracy (70% on TriviaQA, 65% on TruthfulQA, 74% on GSM8K). Robustness experiments further show generalization across LLaMA-3, Mistral, Qwen, and Phi families and resilience against jailbreak attacks (GCG, AutoDAN), consistently maintaining >= 90% refusal of harmful content.
Problem

Research questions and friction points this paper is trying to address.

Addresses LLM safety against adversarial prompts and jailbreak attacks
Overcomes limitation of single-feature safety representation in existing methods
Enables adaptive safety steering while preserving utility on benign queries
Innovation

Methods, ideas, or system contributions that make the work stand out.

Graph-regularized sparse autoencoders with Laplacian penalty
Two-stage gating mechanism for adaptive safety intervention
Recovering distributed safety representations across multiple features