An Empirical Study of Multi-Generation Sampling for Jailbreak Detection in Large Language Models

📅 2026-04-20

📈 Citations: 0

✨ Influential: 0

career value

227K/year

🤖 AI Summary

Existing jailbreak detection methods rely on single-generation assessments, substantially underestimating the true vulnerability risk of large language models under strong alignment. This work proposes a systematic evaluation framework based on multi-round generation sampling, integrating TF-IDF lexical analysis with generation inconsistency metrics. The approach is validated on the JailbreakBench dataset across models with varying alignment strengths. Experimental results demonstrate that moderate-scale sampling significantly enhances detection performance, with diminishing returns beyond a certain threshold. Lexical signals are shown to conflate behavioral and topical cues rather than serving as pure indicators of harmful content. Furthermore, detection signals exhibit partial generalization across models within the same family, offering a promising avenue for efficient red-teaming and auditing.

Technology Category

Application Category

📝 Abstract

Detecting jailbreak behaviour in large language models remains challenging, particularly when strongly aligned models produce harmful outputs only rarely. In this work, we present an empirical study of output based jailbreak detection under realistic conditions using the JailbreakBench Behaviors dataset and multiple generator models with varying alignment strengths. We evaluate both a lexical TF-IDF detector and a generation inconsistency based detector across different sampling budgets. Our results show that single output evaluation systematically underestimates jailbreak vulnerability, as increasing the number of sampled generations reveals additional harmful behaviour. The most significant improvements occur when moving from a single generation to moderate sampling, while larger sampling budgets yield diminishing returns. Cross generator experiments demonstrate that detection signals partially generalise across models, with stronger transfer observed within related model families. A category level analysis further reveals that lexical detectors capture a mixture of behavioural signals and topic specific cues, rather than purely harmful behaviour. Overall, our findings suggest that moderate multi sample auditing provides a more reliable and practical approach for estimating model vulnerability and improving jailbreak detection in large language models. Code will be released.

Problem

Research questions and friction points this paper is trying to address.

jailbreak detection

large language models

harmful outputs

model alignment

sampling-based evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-generation sampling

jailbreak detection

large language models