🤖 AI Summary
This work addresses a novel threat to multimodal large language models (MLLMs) in content moderation—“adversarial smuggling attacks”—where adversaries exploit discrepancies between human and AI perception and reasoning to encode harmful content into visually interpretable forms that evade AI detection. The paper formally defines this attack paradigm, categorizing it into perceptual blind spots and reasoning disruptions, and introduces SmuggleBench, the first dedicated evaluation benchmark comprising 1,700 instances. Experiments reveal alarming vulnerability: leading models such as GPT-5 and Qwen3-VL exhibit attack success rates exceeding 90%. The study further investigates mitigation strategies—including OCR robustness analysis, adversarial example generation, chain-of-thought enhancement, supervised fine-tuning, test-time scaling, and adversarial training—to systematically uncover critical security gaps in MLLM-based content safety systems.
📝 Abstract
Multimodal Large Language Models (MLLMs) are increasingly being deployed as automated content moderators. Within this landscape, we uncover a critical threat: Adversarial Smuggling Attacks. Unlike adversarial perturbations (for misclassification) and adversarial jailbreaks (for harmful output generation), adversarial smuggling exploits the Human-AI capability gap. It encodes harmful content into human-readable visual formats that remain AI-unreadable, thereby evading automated detection and enabling the dissemination of harmful content. We classify smuggling attacks into two pathways: (1) Perceptual Blindness, disrupting text recognition; and (2) Reasoning Blockade, inhibiting semantic understanding despite successful text recognition. To evaluate this threat, we constructed SmuggleBench, the first comprehensive benchmark comprising 1,700 adversarial smuggling attack instances. Evaluations on SmuggleBench reveal that both proprietary (e.g., GPT-5) and open-source (e.g., Qwen3-VL) state-of-the-art models are vulnerable to this threat, producing Attack Success Rates (ASR) exceeding 90%. By analyzing the vulnerability through the lenses of perception and reasoning, we identify three root causes: the limited capabilities of vision encoders, the robustness gap in OCR, and the scarcity of domain-specific adversarial examples. We conduct a preliminary exploration of mitigation strategies, investigating the potential of test-time scaling (via CoT) and adversarial training (via SFT) to mitigate this threat. Our code is publicly available at https://github.com/zhihengli-casia/smugglebench.