Adaptive Probe-based Steering for Robust LLM Jailbreaking

📅 2026-05-19
📈 Citations: 0
Influential: 0
📄 PDF

career value

235K/year
🤖 AI Summary
Existing contrastive prompt-based steering methods for jailbreaking large language models suffer from significant prompt bias and reliance on manual hyperparameter tuning, leading to unstable attack performance. To address this, this work proposes an adaptive steering mechanism that constructs an ideal steering vector by extracting internal model representations and dynamically adjusts the steering intensity based on contrastive activation statistics—eliminating the need for additional contrastive prompts or human intervention. The proposed approach substantially enhances both the robustness and effectiveness of jailbreak attacks, achieving a marked increase in average harmfulness scores from 6% to 70% on hardened large language models, thereby circumventing current safety alignment mechanisms.
📝 Abstract
Recent work has demonstrated the potential of contrastive steering for jailbreaking Large Language Models (LLMs). However, existing methods rely on limited and inherently biased contrastive prompts and require laborious manual tuning of steering strength, limiting their robustness and effectiveness. In this paper, we leverage the idea of model extraction to guide the learned steering vectors to approximate the ideal one and propose tuning the steering strength adaptively based on contrastive activations' statistics. Experiments demonstrate that our method notably improves the effectiveness and robustness of probe-based steering, without any extra contrastive prompts or laborious manual tuning. Being an attack paper, this paper focuses on revealing the breakdown of fortified LLMs, raising the average harmfulness score from 6\% to 70\%. Our code is available at https://github.com/fhdnskfbeuv/adaptiveSteering.
Problem

Research questions and friction points this paper is trying to address.

LLM jailbreaking
contrastive steering
probe-based steering
steering robustness
model extraction
Innovation

Methods, ideas, or system contributions that make the work stand out.

adaptive steering
probe-based jailbreaking
contrastive activations
model extraction
LLM robustness