Distillability of LLM Security Logic: Predicting Attack Success Rate of Outline Filling Attack via Ranking Regression

📅 2025-11-26

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Modeling the safety logic of large language models (LLMs) under black-box jailbreak attacks remains challenging due to limited access to internal mechanisms. Method: This paper first empirically validates the predictability and distillability of LLM jailbreak behavior. It introduces an enhanced outline-filling attack for efficient adversarial prompt generation and proposes a ranking-based regression paradigm—replacing conventional scalar regression—to train a lightweight surrogate model that predicts the relative ordering of attack success rates (ASR). Built upon a compact neural architecture, the model enables dense sampling of safety boundaries and iterative attack optimization within a black-box setting. Contribution/Results: Experiments show the model achieves 91.1% accuracy in relative ranking prediction of average long response (ALR) and 69.2% accuracy in ASR prediction—substantially outperforming baseline methods. The approach establishes a scalable, low-overhead paradigm for LLM safety evaluation.

Technology Category

Application Category

📝 Abstract

In the realm of black-box jailbreak attacks on large language models (LLMs), the feasibility of constructing a narrow safety proxy, a lightweight model designed to predict the attack success rate (ASR) of adversarial prompts, remains underexplored. This work investigates the distillability of an LLM's core security logic. We propose a novel framework that incorporates an improved outline filling attack to achieve dense sampling of the model's security boundaries. Furthermore, we introduce a ranking regression paradigm that replaces standard regression and trains the proxy model to predict which prompt yields a higher ASR. Experimental results show that our proxy model achieves an accuracy of 91.1 percent in predicting the relative ranking of average long response (ALR), and 69.2 percent in predicting ASR. These findings confirm the predictability and distillability of jailbreak behaviors, and demonstrate the potential of leveraging such distillability to optimize black-box attacks.

Problem

Research questions and friction points this paper is trying to address.

Predicting attack success rates for black-box jailbreak attacks on LLMs

Creating lightweight proxy models to distill LLM security logic

Optimizing adversarial prompts via ranking regression instead of standard regression

Innovation

Methods, ideas, or system contributions that make the work stand out.

Lightweight proxy model predicts attack success rates

Improved outline filling attack samples security boundaries densely

Ranking regression trains proxy to compare prompt effectiveness

🔎 Similar Papers

No similar papers found.

Authors to Follow