Steering Beyond the Support: Adversarial Training on Unsupervised Jailbroken Activation Simulation

πŸ“… 2026-05-23
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the limited generalization of existing safety alignment methods, which rely on static supervised data and struggle against out-of-distribution jailbreak attacks. The authors propose a novel two-stage adversarial training framework that, for the first time, integrates unsupervised latent direction discovery with adversarial training. By identifying and extrapolating diverse jailbreak states in the activation space through unsupervised means, the method constructs a potential-induced steering field that pushes such states toward rejection regions while preserving normal responses to benign inputs. This approach achieves zero-shot defense against unseen jailbreak attacks, reducing attack success rates to below 5% across three large language models and six canonical attack types. Moreover, the training process continuously expands subspace coverage, substantially enhancing model generalization.
πŸ“ Abstract
Jailbreak prompts can trigger harmful completions on aligned LLMs, In accordance, safety steering has been proposed: test-time activation interventions that steer jailbreak activations to trigger refusal while preserving benign utility. However, existing steering methods are fundamentally supervised and tied to a static, limited training set, whereas real jailbreaks evolve and are often out-of-distributed from the training set, leading to failures on unseen attacks. In this paper, we tackle the failure on unseen jailbreaks problem, base on unsupervised latent direction discovery. We propose a bi-level adversarial training framework for zero-shot jailbreak defense. In the inner step, we simulate diverse jail-broken activations by extrapolating from refusal-state harmful-request activations via unsupervised latent direction discovery, which expands the coverage of real jailbreak activation subspaces. In the outer step, we train a potential-induced steering field to push these adversarial jailbroken states into refusal regions while keeping benign unchanged. Across three LLMs and six classical jailbreak families, our method achieves strong defense with attack success rates mostly below 5%, and rising subspace coverage throughout training helps explain the improved generalization.
Problem

Research questions and friction points this paper is trying to address.

jailbreak
out-of-distribution
generalization
adversarial robustness
unsupervised
Innovation

Methods, ideas, or system contributions that make the work stand out.

unsupervised latent direction discovery
adversarial training
activation steering
zero-shot jailbreak defense
refusal region
πŸ”Ž Similar Papers
No similar papers found.
πŸ’Ό Related Jobs
L
Luoyu Chen
University of Technology Sydney, Sydney, Australia
W
Weiqi Wang
Xi’an Jiaotong University, Xi’an, China
Z
Zhiyi Tian
School of Cyber Science and Engineering, Southeast University, Nanjing, China
Chenhan Zhang
Chenhan Zhang
PhD
deep Learningprivacy-preserving
Feng Wu
Feng Wu
National University of Singapore
Mechine LearningMedical Time Series
J
Jianhuan Huang
University of Technology Sydney, Sydney, Australia
A
Ahmed Asiri
University of Technology Sydney, Sydney, Australia
S
Shui Yu
University of Technology Sydney, Sydney, Australia