🤖 AI Summary
This work systematically investigates, for the first time, the feasibility of integrating multimodal large language models (MLLMs) with augmented reality (AR) to enable sophisticated social engineering attacks. We propose SEAR, a three-stage closed-loop framework comprising AR context synthesis, role-aware multimodal retrieval-augmented generation (RAG), and an adaptive ReInteract agent, enabling high-trust, multi-stage dynamic attacks. Our key contributions include: (1) establishing a novel AR–LLM co-driven attack paradigm; (2) designing a joint environmental–acoustic–visual modeling mechanism; and (3) introducing an interactive human–agent reasoning loop. In an IRB-approved study with 60 participants, 93.3% fell victim to phishing emails and 85% consented to answer malicious phone calls. To support reproducible evaluation, we publicly release the first annotated AR social dialogue dataset—comprising 180 interaction rounds—demonstrating attack efficacy while identifying critical realism bottlenecks.
📝 Abstract
Augmented Reality (AR) and Multimodal Large Language Models (LLMs) are rapidly evolving, providing unprecedented capabilities for human-computer interaction. However, their integration introduces a new attack surface for social engineering. In this paper, we systematically investigate the feasibility of orchestrating AR-driven Social Engineering attacks using Multimodal LLM for the first time, via our proposed SEAR framework, which operates through three key phases: (1) AR-based social context synthesis, which fuses Multimodal inputs (visual, auditory and environmental cues); (2) role-based Multimodal RAG (Retrieval-Augmented Generation), which dynamically retrieves and integrates contextual data while preserving character differentiation; and (3) ReInteract social engineering agents, which execute adaptive multiphase attack strategies through inference interaction loops. To verify SEAR, we conducted an IRB-approved study with 60 participants in three experimental configurations (unassisted, AR+LLM, and full SEAR pipeline) compiling a new dataset of 180 annotated conversations in simulated social scenarios. Our results show that SEAR is highly effective at eliciting high-risk behaviors (e.g., 93.3% of participants susceptible to email phishing). The framework was particularly effective in building trust, with 85% of targets willing to accept an attacker's call after an interaction. Also, we identified notable limitations such as ``occasionally artificial'' due to perceived authenticity gaps. This work provides proof-of-concept for AR-LLM driven social engineering attacks and insights for developing defensive countermeasures against next-generation augmented reality threats.