🤖 AI Summary
This work exposes a critical security robustness deficiency in large language models (LLMs): existing alignment methods merely suppress surface-level refusal behaviors without restructuring internal representations, rendering models vulnerable to harmful outputs under minute perturbations in latent space. To systematically uncover this alignment fragility, we propose Activation-Guided Attack (ASA), a novel adversarial probing framework. Building on this, we introduce Layer-wise Adversarial Patch Training (LAPT), the first method to inject robustness constraints directly into hidden states during training. We validate LAPT using negative log-likelihood probing, latent-space sensitivity analysis, and adversarial hidden-state perturbation. Results demonstrate that LAPT significantly enhances model resilience against latent-space perturbations while preserving general capabilities. Crucially, this work establishes the first verifiable representation-level safety alignment—providing a new paradigm for deep, robust alignment grounded in internal model semantics.
📝 Abstract
Safety alignment is a key requirement for building reliable Artificial General Intelligence. Despite significant advances in safety alignment, we observe that minor latent shifts can still trigger unsafe responses in aligned models. We argue that this stems from the shallow nature of existing alignment methods, which focus on surface-level refusal behaviors without sufficiently altering internal representations. Consequently, small shifts in hidden activations can re-trigger harmful behaviors embedded in the latent space. To explore the robustness of safety alignment to latent perturbations, we introduce a probing method that measures the Negative Log-Likelihood of the original response generated by the model. This probe quantifies local sensitivity in the latent space, serving as a diagnostic tool for identifying vulnerable directions. Based on this signal, we construct effective jailbreak trajectories, giving rise to the Activation Steering Attack (ASA). More importantly, these insights offer a principled foundation for improving alignment robustness. To this end, we introduce Layer-wise Adversarial Patch Training~(LAPT), a fine-tuning strategy that inject controlled perturbations into hidden representations during training. Experimental results highlight that LAPT strengthen alignment robustness without compromising general capabilities. Our findings reveal fundamental flaws in current alignment paradigms and call for representation-level training strategies that move beyond surface-level behavior supervision. Codes and results are available at https://github.com/Carol-gutianle/LatentSafety.