Jailbreak Distillation: Renewable Safety Benchmarking

📅 2025-05-28

📈 Citations: 0

✨ Influential: 0

career value

242K/year

🤖 AI Summary

Existing LLM safety benchmarks suffer from saturation, data contamination, and cross-model incomparability, undermining evaluation fairness and reliability. Method: This paper proposes “Jailbreak Distillation”—a framework that leverages a small-scale development model and state-of-the-art jailbreaking algorithms to generate a high-quality candidate prompt pool, followed by an automated selection algorithm balancing effectiveness, diversity, and robustness to construct an updatable, generalizable, and fair evaluation subset. Contribution/Results: The approach drastically reduces manual annotation costs while ensuring reproducibility and evaluation fairness. Evaluated on 13 heterogeneous target models—including proprietary, domain-specific, and next-generation LLMs—unseen during benchmark construction, our benchmark achieves superior discriminability, stronger generalization across models, and greater prompt diversity compared to SOTA baselines. It is the first safety evaluation benchmark enabling sustainable, iterative evolution.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) are rapidly deployed in critical applications, raising urgent needs for robust safety benchmarking. We propose Jailbreak Distillation (JBDistill), a novel benchmark construction framework that"distills"jailbreak attacks into high-quality and easily-updatable safety benchmarks. JBDistill utilizes a small set of development models and existing jailbreak attack algorithms to create a candidate prompt pool, then employs prompt selection algorithms to identify an effective subset of prompts as safety benchmarks. JBDistill addresses challenges in existing safety evaluation: the use of consistent evaluation prompts across models ensures fair comparisons and reproducibility. It requires minimal human effort to rerun the JBDistill pipeline and produce updated benchmarks, alleviating concerns on saturation and contamination. Extensive experiments demonstrate our benchmarks generalize robustly to 13 diverse evaluation models held out from benchmark construction, including proprietary, specialized, and newer-generation LLMs, significantly outperforming existing safety benchmarks in effectiveness while maintaining high separability and diversity. Our framework thus provides an effective, sustainable, and adaptable solution for streamlining safety evaluation.

Problem

Research questions and friction points this paper is trying to address.

Develops renewable safety benchmarks for large language models

Distills jailbreak attacks into updatable safety evaluation prompts

Ensures fair comparisons and reproducibility in model safety testing

Innovation

Methods, ideas, or system contributions that make the work stand out.

Distills jailbreak attacks into benchmarks

Uses prompt selection for effective subsets

Ensures fair comparisons and reproducibility

🔎 Similar Papers

Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress?