When Harmful Content Gets Camouflaged: Unveiling Perception Failure of LVLMs with CamHarmTI

📅 2025-11-29

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

This study investigates perceptual deficiencies of Large Vision-Language Models (LVLMs) in detecting camouflaged multimodal harmful content—e.g., images embedding covert malicious text or semantically ambiguous memes. To this end, we introduce CamHarmTI, the first dedicated benchmark comprising 4,500+ expert-annotated samples, integrating hierarchical attention analysis and probing techniques to systematically diagnose LVLM bottlenecks in fine-grained text-image interaction understanding. Experiments reveal a stark performance gap: human accuracy reaches 95.75%, whereas state-of-the-art models like GPT-4o achieve only 2.10%; critical failures originate in early visual encoding layers. Fine-tuning Qwen2.5-VL-7B on CamHarmTI yields a 55.94% absolute accuracy gain, markedly enhancing robustness against concealed cross-modal threats. Our work establishes both a rigorous evaluation framework and actionable insights for improving LVLM safety in real-world multimodal adversarial scenarios.

Technology Category

Application Category

📝 Abstract

Large vision-language models (LVLMs) are increasingly used for tasks where detecting multimodal harmful content is crucial, such as online content moderation. However, real-world harmful content is often camouflaged, relying on nuanced text-image interplay, such as memes or images with embedded malicious text, to evade detection. This raises a key question: extbf{can LVLMs perceive such camouflaged harmful content as sensitively as humans do?} In this paper, we introduce CamHarmTI, a benchmark for evaluating LVLM ability to perceive and interpret camouflaged harmful content within text-image compositions. CamHarmTI consists of over 4,500 samples across three types of image-text posts. Experiments on 100 human users and 12 mainstream LVLMs reveal a clear perceptual gap: humans easily recognize such content (e.g., over 95.75% accuracy), whereas current LVLMs often fail (e.g., ChatGPT-4o achieves only 2.10% accuracy). Moreover, fine-tuning experiments demonstrate that ench serves as an effective resource for improving model perception, increasing accuracy by 55.94% for Qwen2.5VL-7B. Attention analysis and layer-wise probing further reveal that fine-tuning enhances sensitivity primarily in the early layers of the vision encoder, promoting a more integrated scene understanding. These findings highlight the inherent perceptual limitations in LVLMs and offer insight into more human-aligned visual reasoning systems.

Problem

Research questions and friction points this paper is trying to address.

Evaluates LVLMs' ability to detect camouflaged harmful multimodal content.

Reveals a significant perception gap between human and LVLM performance.

Proposes a benchmark to improve model sensitivity through fine-tuning.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introducing CamHarmTI benchmark for camouflaged harmful content

Fine-tuning improves model accuracy by enhancing early vision layers

Attention analysis reveals perceptual gaps between humans and LVLMs

🔎 Similar Papers

No similar papers found.