Towards Realistic Guarantees: A Probabilistic Certificate for SmoothLLM

📅 2025-11-23

📈 Citations: 0

✨ Influential: 0

career value

226K/year

🤖 AI Summary

SmoothLLM provides certified defenses against jailbreak attacks but relies on the strong and practically unrealistic “k-instability” assumption, undermining the credibility of its safety guarantees. Method: We propose a more realistic probabilistic safety definition—(k, ε)-instability—requiring that model outputs remain stable with probability at least 1−ε under up to k perturbations. Building upon this, we develop a data-driven framework to estimate certified safety lower bounds, integrating empirical attack success rate modeling with SmoothLLM’s smoothing mechanism. The framework supports robustness certification against mainstream gradient- and semantics-based jailbreak attacks, including GCG and PAIR. Contribution/Results: Our approach ensures theoretical rigor while maintaining deployment feasibility. Evaluated across multiple LLMs and attack scenarios, it significantly improves both certification coverage and robustness, delivering verifiable, practical security guarantees for large language models.

Technology Category

Application Category

📝 Abstract

The SmoothLLM defense provides a certification guarantee against jailbreaking attacks, but it relies on a strict `k-unstable' assumption that rarely holds in practice. This strong assumption can limit the trustworthiness of the provided safety certificate. In this work, we address this limitation by introducing a more realistic probabilistic framework, `(k, $varepsilon$)-unstable,' to certify defenses against diverse jailbreaking attacks, from gradient-based (GCG) to semantic (PAIR). We derive a new, data-informed lower bound on SmoothLLM's defense probability by incorporating empirical models of attack success, providing a more trustworthy and practical safety certificate. By introducing the notion of (k, $varepsilon$)-unstable, our framework provides practitioners with actionable safety guarantees, enabling them to set certification thresholds that better reflect the real-world behavior of LLMs. Ultimately, this work contributes a practical and theoretically-grounded mechanism to make LLMs more resistant to the exploitation of their safety alignments, a critical challenge in secure AI deployment.

Problem

Research questions and friction points this paper is trying to address.

SmoothLLM defense relies on unrealistic k-unstable assumption

Strong assumption limits trustworthiness of safety certificates

Need probabilistic framework for realistic jailbreak attack certification

Innovation

Methods, ideas, or system contributions that make the work stand out.

Probabilistic framework certifies defenses against jailbreak attacks

Data-informed lower bound improves trustworthiness of safety certificates

Realistic certification thresholds reflect real-world LLM behavior

🔎 Similar Papers

Accelerated Smoothing: A Scalable Approach to Randomized Smoothing

2024-02-12arXiv.orgCitations: 0

Publicly Detectable Watermarking for Language Models

2023-10-27IACR Cryptology ePrint ArchiveCitations: 38

Bosch Group

Elchingen, BY, DE

Authors to Follow