Defending Large Language Models Against Jailbreak Attacks via In-Decoding Safety-Awareness Probing

πŸ“… 2026-01-15
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Large language models remain vulnerable to jailbreak attacks that circumvent existing alignment mechanisms and induce harmful outputs. This work proposes a decoding-phase, safety-aware probing method that, for the first time, reveals the persistence of detectable internal safety signals within the model even during successful jailbreak attempts. By dynamically probing and intervening in real time, the approach activates the model’s intrinsic safety awareness to enable early-stage defense. Notably, the method requires no architectural modifications or compromises to generation fluency, achieving significantly enhanced robustness against diverse jailbreak attacks while maintaining a low false rejection rate and high response quality. These results demonstrate the effectiveness and practicality of leveraging latent safety signals for real-time mitigation without degrading model performance.

Technology Category

Application Category

πŸ“ Abstract
Large language models (LLMs) have achieved impressive performance across natural language tasks and are increasingly deployed in real-world applications. Despite extensive safety alignment efforts, recent studies show that such alignment is often shallow and remains vulnerable to jailbreak attacks. Existing defense mechanisms, including decoding-based constraints and post-hoc content detectors, struggle against sophisticated jailbreaks, often intervening robust detection or excessively degrading model utility. In this work, we examine the decoding process of LLMs and make a key observation: even when successfully jailbroken, models internally exhibit latent safety-related signals during generation. However, these signals are overridden by the model's drive for fluent continuation, preventing timely self-correction or refusal. Building on this observation, we propose a simple yet effective approach that explicitly surfaces and leverages these latent safety signals for early detection of unsafe content during decoding. Experiments across diverse jailbreak attacks demonstrate that our approach significantly enhances safety, while maintaining low over-refusal rates on benign inputs and preserving response quality. Our results suggest that activating intrinsic safety-awareness during decoding offers a promising and complementary direction for defending against jailbreak attacks. Code is available at: https://github.com/zyz13590/SafeProbing.
Problem

Research questions and friction points this paper is trying to address.

jailbreak attacks
large language models
safety alignment
decoding process
defense mechanisms
Innovation

Methods, ideas, or system contributions that make the work stand out.

jailbreak defense
in-decoding safety probing
latent safety signals
large language models
safety-aware generation
πŸ”Ž Similar Papers
Y
Yinzhi Zhao
Northeastern University, China
Ming Wang
Ming Wang
Ph.D. student of Data Mining Group, Northeastern University - Shenyang
Machine PsychologyAI for Mental HealthLLM-based Agents
S
Shi Feng
Northeastern University, China
Xiaocui Yang
Xiaocui Yang
Lecturer, Northeastern University (China)
Multimodal Sentiment AnalysisData MiningMultimodal Large Language Models
D
Daling Wang
Northeastern University, China
Y
Yifei Zhang
Northeastern University, China