Jailbreaking LLMs via Calibration

📅 2026-01-31

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

This work addresses the challenge that safety alignment in large language models distorts their outputs away from the original data distribution, thereby hindering effective evaluation of jailbreak attacks. The authors model alignment-induced next-token prediction bias as a systematic distortion of the pre-aligned distribution and reframe the weak-to-strong jailbreaking problem as a prediction aggregation task. Within a loss-induced dual space, they derive an optimal gradient-shift-based aggregation strategy. They further show that logit arithmetic is a special case of this framework under cross-entropy loss and generalize it to arbitrary proper losses, proposing a novel hybrid aggregation algorithm. Experiments demonstrate significant improvements in attack success rates and reduced “jailbreak tax” across multiple red-teaming benchmarks and mathematical tasks, with particularly strong performance on the safety-hardened gpt-oss-120b model.

Technology Category

Application Category

📝 Abstract

Safety alignment in Large Language Models (LLMs) often creates a systematic discrepancy between a model's aligned output and the underlying pre-aligned data distribution. We propose a framework in which the effect of safety alignment on next-token prediction is modeled as a systematic distortion of a pre-alignment distribution. We cast Weak-to-Strong Jailbreaking as a forecast aggregation problem and derive an optimal aggregation strategy characterized by a Gradient Shift in the loss-induced dual space. We show that logit-arithmetic jailbreaking methods are a special case of this framework under cross-entropy loss, and derive a broader family of aggregation rules corresponding to other proper losses. We also propose a new hybrid aggregation rule. Evaluations across red-teaming benchmarks and math utility tasks using frontier models demonstrate that our approach achieves superior Attack Success Rates and lower"Jailbreak Tax"compared with existing methods, especially on the safety-hardened gpt-oss-120b.

Problem

Research questions and friction points this paper is trying to address.

Jailbreaking

Safety Alignment

Large Language Models

Distribution Distortion

Red-Teaming

Innovation

Methods, ideas, or system contributions that make the work stand out.

Jailbreaking

Safety Alignment

Forecast Aggregation