Can LLMs Deceive CLIP? Benchmarking Adversarial Compositionality of Pre-trained Multimodal Representation via Text Updates

📅 2025-05-28

📈 Citations: 0

✨ Influential: 0

career value

215K/year

🤖 AI Summary

This work identifies a systematic adversarial vulnerability in pretrained multimodal models (e.g., CLIP) during compositional reasoning—specifically, their susceptibility to misclassification when presented with semantically plausible yet structurally counterintuitive image-text, video-text, or audio-text pairings. To systematically evaluate this weakness, we introduce the Multimodal Adversarial Compositionality (MAC) benchmark. Our method proposes a novel self-training framework that jointly leverages rejection sampling and entropy-driven diversity filtering to generate highly diverse, high-success-rate adversarial texts—using only a lightweight large language model (Llama-3.1-8B) in zero-shot mode. The framework enables unified evaluation across image, video, and audio modalities. Experiments demonstrate that even small-scale LLMs can efficiently probe compositional vulnerabilities in state-of-the-art multimodal models, significantly improving attack effectiveness and semantic coverage of adversarial texts. This work establishes a new paradigm for robustness evaluation and interpretability analysis in multimodal AI.

Technology Category

Application Category

📝 Abstract

While pre-trained multimodal representations (e.g., CLIP) have shown impressive capabilities, they exhibit significant compositional vulnerabilities leading to counterintuitive judgments. We introduce Multimodal Adversarial Compositionality (MAC), a benchmark that leverages large language models (LLMs) to generate deceptive text samples to exploit these vulnerabilities across different modalities and evaluates them through both sample-wise attack success rate and group-wise entropy-based diversity. To improve zero-shot methods, we propose a self-training approach that leverages rejection-sampling fine-tuning with diversity-promoting filtering, which enhances both attack success rate and sample diversity. Using smaller language models like Llama-3.1-8B, our approach demonstrates superior performance in revealing compositional vulnerabilities across various multimodal representations, including images, videos, and audios.

Problem

Research questions and friction points this paper is trying to address.

Benchmarking adversarial vulnerabilities in multimodal representations

Generating deceptive text samples via LLMs to exploit CLIP

Improving zero-shot attacks with self-training and diversity filtering

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLMs generate deceptive text for CLIP vulnerabilities

Self-training with rejection-sampling enhances attack diversity

Smaller models reveal multimodal compositional weaknesses effectively

🔎 Similar Papers

No similar papers found.