Detecting Visual Information Manipulation Attacks in Augmented Reality: A Multimodal Semantic Reasoning Approach

📅 2025-07-27

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

Visual Information Manipulation (VIM) attacks in augmented reality (AR) covertly alter the semantic interpretation of real-world scenes via malicious virtual content, leading to user misperception and erroneous decision-making. Method: This paper proposes VIM-Sense, the first multimodal semantic reasoning framework specifically designed for AR-VIM detection. It introduces a novel AR-VIM taxonomy and benchmark dataset, systematically defining three attack types: information replacement, confusion, and erroneous addition. VIM-Sense integrates vision-language models (VLMs) for cross-modal semantic understanding with optical character recognition (OCR)-based fine-grained text analysis to enable end-to-end detection. Contribution/Results: Evaluated on the AR-VIM dataset, VIM-Sense achieves 88.94% accuracy—significantly outperforming unimodal baselines. It processes video at an average latency of 7.07 seconds (7.17 seconds on mobile devices), demonstrating both high accuracy and practical deployability.

Technology Category

Application Category

📝 Abstract

The virtual content in augmented reality (AR) can introduce misleading or harmful information, leading to semantic misunderstandings or user errors. In this work, we focus on visual information manipulation (VIM) attacks in AR where virtual content changes the meaning of real-world scenes in subtle but impactful ways. We introduce a taxonomy that categorizes these attacks into three formats: character, phrase, and pattern manipulation, and three purposes: information replacement, information obfuscation, and extra wrong information. Based on the taxonomy, we construct a dataset, AR-VIM. It consists of 452 raw-AR video pairs spanning 202 different scenes, each simulating a real-world AR scenario. To detect such attacks, we propose a multimodal semantic reasoning framework, VIM-Sense. It combines the language and visual understanding capabilities of vision-language models (VLMs) with optical character recognition (OCR)-based textual analysis. VIM-Sense achieves an attack detection accuracy of 88.94% on AR-VIM, consistently outperforming vision-only and text-only baselines. The system reaches an average attack detection latency of 7.07 seconds in a simulated video processing framework and 7.17 seconds in a real-world evaluation conducted on a mobile Android AR application.

Problem

Research questions and friction points this paper is trying to address.

Detect subtle visual manipulation attacks in AR

Classify AR attacks by format and purpose

Improve accuracy in multimodal attack detection

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal semantic reasoning framework VIM-Sense

Combines vision-language models with OCR analysis

Detects AR visual manipulation attacks accurately

🔎 Similar Papers

Semantic Contextualization of Face Forgery: A New Definition, Dataset, and Detection Method