🤖 AI Summary
Large language models (LLMs) are vulnerable to semantics-preserving adversarial prompt attacks; existing robustness certification methods suffer from loose certified bounds due to the absence of semantic consistency verification and incur high computational overhead from repeated sampling. To address these limitations, we propose a clustering-guided denoising smoothing framework. Our method introduces a novel semantic clustering filtering mechanism that explicitly enforces meaning consistency among perturbed outputs; derives tighter probabilistic robustness bounds via theoretical analysis; and incorporates a refinement module with an efficient synonym substitution strategy to substantially reduce sampling requirements. Evaluated on multiple benchmark tasks and jailbreak defense scenarios, our approach achieves significantly tightened certified bounds and improves inference speed by 3.2×, outperforming state-of-the-art certification methods in both robustness and efficiency.
📝 Abstract
Recent advancements in Large Language Models (LLMs) have led to their widespread adoption in daily applications. Despite their impressive capabilities, they remain vulnerable to adversarial attacks, as even minor meaning-preserving changes such as synonym substitutions can lead to incorrect predictions. As a result, certifying the robustness of LLMs against such adversarial prompts is of vital importance. Existing approaches focused on word deletion or simple denoising strategies to achieve robustness certification. However, these methods face two critical limitations: (1) they yield loose robustness bounds due to the lack of semantic validation for perturbed outputs and (2) they suffer from high computational costs due to repeated sampling. To address these limitations, we propose CluCERT, a novel framework for certifying LLM robustness via clustering-guided denoising smoothing. Specifically, to achieve tighter certified bounds, we introduce a semantic clustering filter that reduces noisy samples and retains meaningful perturbations, supported by theoretical analysis. Furthermore, we enhance computational efficiency through two mechanisms: a refine module that extracts core semantics, and a fast synonym substitution strategy that accelerates the denoising process. Finally, we conduct extensive experiments on various downstream tasks and jailbreak defense scenarios. Experimental results demonstrate that our method outperforms existing certified approaches in both robustness bounds and computational efficiency.