🤖 AI Summary
Despite safety alignment, large language models remain vulnerable to jailbreak attacks that elicit harmful outputs, yet the underlying mechanisms are poorly understood. This work proposes a three-stage analytical framework applied to the Gemma-2-2B model: first extracting concept-aligned tokens from adversarial responses, then identifying critical feature subsets within sparse autoencoders (SAEs) at each layer, and finally amplifying these features to assess their causal influence on harmful generation. By integrating subspace similarity, clustering, and hierarchical linking for feature grouping—alongside standardized LLM evaluation protocols—the study reveals, for the first time, that jailbreak vulnerabilities are concentrated in specific feature subsets within the model’s middle-to-late layers (layers 16–25). This finding demonstrates the layer-localized nature of such vulnerabilities and opens new avenues for feature-level adversarial defense.
📝 Abstract
Large language models (LLMs) can still be jailbroken into producing harmful outputs despite safety alignment. Existing attacks show this vulnerability, but not the internal mechanisms that cause it. This study asks whether jailbreak success is driven by identifiable internal features rather than prompts alone. We propose a three-stage pipeline for Gemma-2-2B using the BeaverTails dataset. First, we extract concept-aligned tokens from adversarial responses via subspace similarity. Second, we apply three feature-grouping strategies (cluster, hierarchical-linkage, and single-token-driven) to identify SAE feature subgroups for the aligned tokens across all 26 model layers. Third, we steer the model by amplifying the top features from each identified subgroup and measure the change in harmfulness score using a standardized LLM-judge scoring protocol. In all three approaches, the features in the layers [16-25] were relatively more vulnerable to steering. All three methods confirmed that mid to later layer feature subgroups are more responsible for unsafe outputs. These results provide evidence that the jailbreak vulnerability in Gemma-2-2B is localized to feature subgroups of mid to later layers, suggesting that targeted feature-level interventions may offer a more principled path to adversarial robustness than current prompt-level defenses.