SafeDream: Safety World Model for Proactive Early Jailbreak Detection

📅 2026-04-18

📈 Citations: 0

✨ Influential: 0

career value

223K/year

🤖 AI Summary

This work addresses the challenge of detecting multi-turn jailbreak attacks, which progressively bypass existing safety mechanisms through conversational dialogue and are difficult to identify before harmful content is generated. The authors propose SafeDream, a lightweight external safety world model that operates without modifying the target model’s weights and introduces, for the first time, the “detection lead time” metric. By integrating CUSUM-based cumulative sum detection, contrastive imagination in latent space reasoning, and multi-turn dialogue safety representation learning, SafeDream achieves an average advance warning of 1.06–1.20 turns prior to successful jailbreaks across three benchmarks. The framework significantly outperforms eight baseline methods, offering substantially improved detection timeliness and accuracy while maintaining a low false positive rate.

Technology Category

Application Category

📝 Abstract

Multi-turn jailbreak attacks progressively erode LLM safety alignment across seemingly innocuous conversation turns, achieving success rates exceeding 90% against state-of-the-art models. Existing alignment-based and guardrail methods suffer from three key limitations: they require costly weight modification, evaluate each turn independently without modeling cumulative safety erosion, and detect attacks only after harmful content has been generated. To address these limitations, we first formulate the proactive early jailbreak detection problem with a new metric, detection lead, that measures how early an attack can be detected before the LLM complies. We then propose SAFEDREAM, a lightweight world-model-based framework that operates as an external module without modifying the LLM's weights. SAFEDREAM introduces three components: (1) a safety state world model that encodes LLM hidden states into a compact safety representation and predicts how it evolves across turns, (2) CUSUM detection that accumulates weak per-turn risk signals into reliable evidence, and (3) contrastive imagination that simultaneously rolls out attack and benign futures in latent space to issue early alarms before jailbreaks occur. On three multi-turn jailbreak benchmarks (XGuard-Train, SafeDialBench, SafeMTData) against 8 baselines, SAFEDREAM achieves the best detection timeliness across all benchmarks (1.06-1.20 turns before compliance) while maintaining competitive false positive rates and outperforming baselines in detection quality.

Problem

Research questions and friction points this paper is trying to address.

jailbreak detection

safety alignment

multi-turn attacks

proactive detection

large language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

proactive early detection

safety world model

multi-turn jailbreak