🤖 AI Summary
This work addresses the security threat posed by contradictory virtual content attacks in augmented reality (AR), which can mislead users, induce semantic confusion, or propagate harmful information. To tackle this challenge, the authors introduce ContrAR, the first systematic evaluation framework for assessing model robustness against such adversarial content. They construct a novel dataset comprising human-verified real-world AR videos and conduct a multidimensional benchmark evaluation across eleven state-of-the-art vision-language models. Experimental results reveal that while current models exhibit basic comprehension capabilities, they remain notably deficient in detecting and reasoning about contradictory AR content, often failing to balance accuracy and latency effectively. This study establishes the first standardized benchmark to advance the development of more robust AR systems.
📝 Abstract
Augmented reality (AR) has rapidly expanded over the past decade. As AR becomes increasingly integrated into daily life, its security and reliability emerge as critical challenges. Among various threats, contradictory virtual content attacks, where malicious or inconsistent virtual elements are introduced into the user's view, pose a unique risk by misleading users, creating semantic confusion, or delivering harmful information. In this work, we systematically model such attacks and present ContrAR, a novel benchmark for evaluating the robustness of vision-language models (VLMs) against virtual content manipulation and contradiction in AR. ContrAR contains 312 real-world AR videos validated by 10 human participants. We further benchmark 11 VLMs, including both commercial and open-source models. Experimental results reveal that while current VLMs exhibit reasonable understanding of contradictory virtual content, room still remains for improvement in detecting and reasoning about adversarial content manipulations in AR environments. Moreover, balancing detection accuracy and latency remains challenging.