π€ AI Summary
This study investigates perceptual deficiencies of Large Vision-Language Models (LVLMs) in detecting camouflaged multimodal harmful contentβe.g., images embedding covert malicious text or semantically ambiguous memes. To this end, we introduce CamHarmTI, the first dedicated benchmark comprising 4,500+ expert-annotated samples, integrating hierarchical attention analysis and probing techniques to systematically diagnose LVLM bottlenecks in fine-grained text-image interaction understanding. Experiments reveal a stark performance gap: human accuracy reaches 95.75%, whereas state-of-the-art models like GPT-4o achieve only 2.10%; critical failures originate in early visual encoding layers. Fine-tuning Qwen2.5-VL-7B on CamHarmTI yields a 55.94% absolute accuracy gain, markedly enhancing robustness against concealed cross-modal threats. Our work establishes both a rigorous evaluation framework and actionable insights for improving LVLM safety in real-world multimodal adversarial scenarios.
π Abstract
Large vision-language models (LVLMs) are increasingly used for tasks where detecting multimodal harmful content is crucial, such as online content moderation. However, real-world harmful content is often camouflaged, relying on nuanced text-image interplay, such as memes or images with embedded malicious text, to evade detection. This raises a key question: extbf{can LVLMs perceive such camouflaged harmful content as sensitively as humans do?} In this paper, we introduce CamHarmTI, a benchmark for evaluating LVLM ability to perceive and interpret camouflaged harmful content within text-image compositions. CamHarmTI consists of over 4,500 samples across three types of image-text posts. Experiments on 100 human users and 12 mainstream LVLMs reveal a clear perceptual gap: humans easily recognize such content (e.g., over 95.75% accuracy), whereas current LVLMs often fail (e.g., ChatGPT-4o achieves only 2.10% accuracy). Moreover, fine-tuning experiments demonstrate that ench serves as an effective resource for improving model perception, increasing accuracy by 55.94% for Qwen2.5VL-7B. Attention analysis and layer-wise probing further reveal that fine-tuning enhances sensitivity primarily in the early layers of the vision encoder, promoting a more integrated scene understanding. These findings highlight the inherent perceptual limitations in LVLMs and offer insight into more human-aligned visual reasoning systems.