Probing the Robustness of Large Language Models Safety to Latent Perturbations

📅 2025-06-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work exposes a critical security robustness deficiency in large language models (LLMs): existing alignment methods merely suppress surface-level refusal behaviors without restructuring internal representations, rendering models vulnerable to harmful outputs under minute perturbations in latent space. To systematically uncover this alignment fragility, we propose Activation-Guided Attack (ASA), a novel adversarial probing framework. Building on this, we introduce Layer-wise Adversarial Patch Training (LAPT), the first method to inject robustness constraints directly into hidden states during training. We validate LAPT using negative log-likelihood probing, latent-space sensitivity analysis, and adversarial hidden-state perturbation. Results demonstrate that LAPT significantly enhances model resilience against latent-space perturbations while preserving general capabilities. Crucially, this work establishes the first verifiable representation-level safety alignment—providing a new paradigm for deep, robust alignment grounded in internal model semantics.

Technology Category

Application Category

📝 Abstract
Safety alignment is a key requirement for building reliable Artificial General Intelligence. Despite significant advances in safety alignment, we observe that minor latent shifts can still trigger unsafe responses in aligned models. We argue that this stems from the shallow nature of existing alignment methods, which focus on surface-level refusal behaviors without sufficiently altering internal representations. Consequently, small shifts in hidden activations can re-trigger harmful behaviors embedded in the latent space. To explore the robustness of safety alignment to latent perturbations, we introduce a probing method that measures the Negative Log-Likelihood of the original response generated by the model. This probe quantifies local sensitivity in the latent space, serving as a diagnostic tool for identifying vulnerable directions. Based on this signal, we construct effective jailbreak trajectories, giving rise to the Activation Steering Attack (ASA). More importantly, these insights offer a principled foundation for improving alignment robustness. To this end, we introduce Layer-wise Adversarial Patch Training~(LAPT), a fine-tuning strategy that inject controlled perturbations into hidden representations during training. Experimental results highlight that LAPT strengthen alignment robustness without compromising general capabilities. Our findings reveal fundamental flaws in current alignment paradigms and call for representation-level training strategies that move beyond surface-level behavior supervision. Codes and results are available at https://github.com/Carol-gutianle/LatentSafety.
Problem

Research questions and friction points this paper is trying to address.

Assessing safety robustness of aligned language models to latent perturbations
Identifying vulnerabilities in alignment methods via hidden activation shifts
Developing representation-level training to enhance alignment robustness
Innovation

Methods, ideas, or system contributions that make the work stand out.

Probing latent space with Negative Log-Likelihood metric
Activation Steering Attack exploits vulnerable directions
Layer-wise Adversarial Patch Training enhances robustness
🔎 Similar Papers
2023-11-15Conference on Empirical Methods in Natural Language ProcessingCitations: 2
Tianle Gu
Tianle Gu
Tsinghua University
(M)LLM SafetyPEFT
K
Kexin Huang
Fudan University
Zongqi Wang
Zongqi Wang
Tsinghua University
Reward ModelRLEvaluationAI Safety
Y
Yixu Wang
Shanghai Artificial Intelligence Laboratory
J
Jie Li
Shanghai Artificial Intelligence Laboratory
Yuanqi Yao
Yuanqi Yao
INSAIT
RoboticsManipulation
Y
Yang Yao
The University of Hong Kong
Yujiu Yang
Yujiu Yang
SIGS, Tsinghua University
Machine Learning, Nature language processing, Computer vision
Y
Yan Teng
Shanghai Artificial Intelligence Laboratory
Y
Yingchun Wang
Shanghai Artificial Intelligence Laboratory