CeTAD: Towards Certified Toxicity-Aware Distance in Vision Language Models

📅 2025-03-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the insufficient robustness of large vision-language models (VLMs) against visual jailbreaking attacks. Methodologically, it introduces the first *toxicity-aware* defense framework with formal guarantees. It proposes a novel latent-space semantic consistency-based toxicity-aware distance metric—overcoming limitations of conventional cosine similarity—and designs a regression-style randomized smoothing certification scheme, establishing the first formal black-box robustness guarantees for VLMs under both pixel-level and structural perturbations. Certification is efficiently realized via Gaussian or Laplacian noise injection. Experiments demonstrate that the framework achieves an average defense success rate of 92.4% across diverse visual jailbreaking attacks, with certified radii improved by 3.1× over state-of-the-art heuristic methods—marking a significant advance in provably robust VLM security.

Technology Category

Application Category

📝 Abstract
Recent advances in large vision-language models (VLMs) have demonstrated remarkable success across a wide range of visual understanding tasks. However, the robustness of these models against jailbreak attacks remains an open challenge. In this work, we propose a universal certified defence framework to safeguard VLMs rigorously against potential visual jailbreak attacks. First, we proposed a novel distance metric to quantify semantic discrepancies between malicious and intended responses, capturing subtle differences often overlooked by conventional cosine similarity-based measures. Then, we devise a regressed certification approach that employs randomized smoothing to provide formal robustness guarantees against both adversarial and structural perturbations, even under black-box settings. Complementing this, our feature-space defence introduces noise distributions (e.g., Gaussian, Laplacian) into the latent embeddings to safeguard against both pixel-level and structure-level perturbations. Our results highlight the potential of a formally grounded, integrated strategy toward building more resilient and trustworthy VLMs.
Problem

Research questions and friction points this paper is trying to address.

Develop certified defense against visual jailbreak attacks in VLMs.
Introduce novel distance metric for semantic discrepancy quantification.
Provide robustness guarantees via randomized smoothing and noise distributions.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Novel distance metric for semantic discrepancy quantification
Regressed certification with randomized smoothing for robustness
Feature-space defence using noise distributions in embeddings
🔎 Similar Papers
No similar papers found.