When Harmful Content Gets Camouflaged: Unveiling Perception Failure of LVLMs with CamHarmTI

πŸ“… 2025-11-29
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This study investigates perceptual deficiencies of Large Vision-Language Models (LVLMs) in detecting camouflaged multimodal harmful contentβ€”e.g., images embedding covert malicious text or semantically ambiguous memes. To this end, we introduce CamHarmTI, the first dedicated benchmark comprising 4,500+ expert-annotated samples, integrating hierarchical attention analysis and probing techniques to systematically diagnose LVLM bottlenecks in fine-grained text-image interaction understanding. Experiments reveal a stark performance gap: human accuracy reaches 95.75%, whereas state-of-the-art models like GPT-4o achieve only 2.10%; critical failures originate in early visual encoding layers. Fine-tuning Qwen2.5-VL-7B on CamHarmTI yields a 55.94% absolute accuracy gain, markedly enhancing robustness against concealed cross-modal threats. Our work establishes both a rigorous evaluation framework and actionable insights for improving LVLM safety in real-world multimodal adversarial scenarios.

Technology Category

Application Category

πŸ“ Abstract
Large vision-language models (LVLMs) are increasingly used for tasks where detecting multimodal harmful content is crucial, such as online content moderation. However, real-world harmful content is often camouflaged, relying on nuanced text-image interplay, such as memes or images with embedded malicious text, to evade detection. This raises a key question: extbf{can LVLMs perceive such camouflaged harmful content as sensitively as humans do?} In this paper, we introduce CamHarmTI, a benchmark for evaluating LVLM ability to perceive and interpret camouflaged harmful content within text-image compositions. CamHarmTI consists of over 4,500 samples across three types of image-text posts. Experiments on 100 human users and 12 mainstream LVLMs reveal a clear perceptual gap: humans easily recognize such content (e.g., over 95.75% accuracy), whereas current LVLMs often fail (e.g., ChatGPT-4o achieves only 2.10% accuracy). Moreover, fine-tuning experiments demonstrate that ench serves as an effective resource for improving model perception, increasing accuracy by 55.94% for Qwen2.5VL-7B. Attention analysis and layer-wise probing further reveal that fine-tuning enhances sensitivity primarily in the early layers of the vision encoder, promoting a more integrated scene understanding. These findings highlight the inherent perceptual limitations in LVLMs and offer insight into more human-aligned visual reasoning systems.
Problem

Research questions and friction points this paper is trying to address.

Evaluates LVLMs' ability to detect camouflaged harmful multimodal content.
Reveals a significant perception gap between human and LVLM performance.
Proposes a benchmark to improve model sensitivity through fine-tuning.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introducing CamHarmTI benchmark for camouflaged harmful content
Fine-tuning improves model accuracy by enhancing early vision layers
Attention analysis reveals perceptual gaps between humans and LVLMs
πŸ”Ž Similar Papers
No similar papers found.
Y
Yanhui Li
Zhejiang University, Hangzhou, China
Q
Qi Zhou
Zhejiang University, Hangzhou, China
Z
Zhihong Xu
Zhejiang University, Hangzhou, China
Huizhong Guo
Huizhong Guo
Zhejiang University
Deep LearningTrustwrothy Recommender SystemFairness TestingAI Ethic
W
Wenhai Wang
Zhejiang University, Hangzhou, China
Dongxia Wang
Dongxia Wang
Zhejiang University
Intelligent SystemsTrustworthy AISecurity