🤖 AI Summary
This work identifies a systematic adversarial vulnerability in pretrained multimodal models (e.g., CLIP) during compositional reasoning—specifically, their susceptibility to misclassification when presented with semantically plausible yet structurally counterintuitive image-text, video-text, or audio-text pairings. To systematically evaluate this weakness, we introduce the Multimodal Adversarial Compositionality (MAC) benchmark. Our method proposes a novel self-training framework that jointly leverages rejection sampling and entropy-driven diversity filtering to generate highly diverse, high-success-rate adversarial texts—using only a lightweight large language model (Llama-3.1-8B) in zero-shot mode. The framework enables unified evaluation across image, video, and audio modalities. Experiments demonstrate that even small-scale LLMs can efficiently probe compositional vulnerabilities in state-of-the-art multimodal models, significantly improving attack effectiveness and semantic coverage of adversarial texts. This work establishes a new paradigm for robustness evaluation and interpretability analysis in multimodal AI.
📝 Abstract
While pre-trained multimodal representations (e.g., CLIP) have shown impressive capabilities, they exhibit significant compositional vulnerabilities leading to counterintuitive judgments. We introduce Multimodal Adversarial Compositionality (MAC), a benchmark that leverages large language models (LLMs) to generate deceptive text samples to exploit these vulnerabilities across different modalities and evaluates them through both sample-wise attack success rate and group-wise entropy-based diversity. To improve zero-shot methods, we propose a self-training approach that leverages rejection-sampling fine-tuning with diversity-promoting filtering, which enhances both attack success rate and sample diversity. Using smaller language models like Llama-3.1-8B, our approach demonstrates superior performance in revealing compositional vulnerabilities across various multimodal representations, including images, videos, and audios.