Test-Time Immunization: A Universal Defense Framework Against Jailbreaks for (Multimodal) Large Language Models

📅 2025-05-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Contemporary multimodal large language models (MLLMs) face diverse jailbreak attacks, yet existing defenses suffer from poor generalizability and modality limitations. Method: We propose Test-time Immunity (TIM), the first test-time self-evolving defense framework that decouples dynamic attack detection from instruction-rejection-driven safety fine-tuning, enabling unified protection against cross-modal (text and image) jailbreaks. TIM incorporates gist token training, dynamic trigger detection, and modular parameter isolation to ensure both robustness and inference stability. Contribution/Results: Extensive evaluation across multiple LLMs and MLLMs demonstrates that TIM significantly improves jailbreak resistance while preserving original task performance—with negligible degradation (<0.5%). TIM is the first jailbreak defense framework achieving universality (across models and modalities), adaptivity (test-time evolution), and practical deployability.

Technology Category

Application Category

📝 Abstract
While (multimodal) large language models (LLMs) have attracted widespread attention due to their exceptional capabilities, they remain vulnerable to jailbreak attacks. Various defense methods are proposed to defend against jailbreak attacks, however, they are often tailored to specific types of jailbreak attacks, limiting their effectiveness against diverse adversarial strategies. For instance, rephrasing-based defenses are effective against text adversarial jailbreaks but fail to counteract image-based attacks. To overcome these limitations, we propose a universal defense framework, termed Test-time IMmunization (TIM), which can adaptively defend against various jailbreak attacks in a self-evolving way. Specifically, TIM initially trains a gist token for efficient detection, which it subsequently applies to detect jailbreak activities during inference. When jailbreak attempts are identified, TIM implements safety fine-tuning using the detected jailbreak instructions paired with refusal answers. Furthermore, to mitigate potential performance degradation in the detector caused by parameter updates during safety fine-tuning, we decouple the fine-tuning process from the detection module. Extensive experiments on both LLMs and multimodal LLMs demonstrate the efficacy of TIM.
Problem

Research questions and friction points this paper is trying to address.

Defending (multimodal) LLMs against diverse jailbreak attacks
Overcoming limitations of attack-specific defense methods
Preventing performance degradation during safety fine-tuning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Universal defense framework against jailbreak attacks
Self-evolving gist token for adaptive detection
Decoupled safety fine-tuning to prevent degradation
Yongcan Yu
Yongcan Yu
Master Student, CASIA
Trustworthy AISafety in AI
Y
Yanbo Wang
NLPR & MAIS, Institute of Automation, Chinese Academy of Sciences; School of Artificial Intelligence, University of Chinese Academy of Sciences
R
Ran He
NLPR & MAIS, Institute of Automation, Chinese Academy of Sciences; School of Artificial Intelligence, University of Chinese Academy of Sciences
Jian Liang
Jian Liang
Kuaishou Inc.
transfer learninggraph learning