On the Feasibility of Using MultiModal LLMs to Execute AR Social Engineering Attacks

📅 2025-04-16

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

This work systematically investigates, for the first time, the feasibility of integrating multimodal large language models (MLLMs) with augmented reality (AR) to enable sophisticated social engineering attacks. We propose SEAR, a three-stage closed-loop framework comprising AR context synthesis, role-aware multimodal retrieval-augmented generation (RAG), and an adaptive ReInteract agent, enabling high-trust, multi-stage dynamic attacks. Our key contributions include: (1) establishing a novel AR–LLM co-driven attack paradigm; (2) designing a joint environmental–acoustic–visual modeling mechanism; and (3) introducing an interactive human–agent reasoning loop. In an IRB-approved study with 60 participants, 93.3% fell victim to phishing emails and 85% consented to answer malicious phone calls. To support reproducible evaluation, we publicly release the first annotated AR social dialogue dataset—comprising 180 interaction rounds—demonstrating attack efficacy while identifying critical realism bottlenecks.

Technology Category

Application Category

📝 Abstract

Augmented Reality (AR) and Multimodal Large Language Models (LLMs) are rapidly evolving, providing unprecedented capabilities for human-computer interaction. However, their integration introduces a new attack surface for social engineering. In this paper, we systematically investigate the feasibility of orchestrating AR-driven Social Engineering attacks using Multimodal LLM for the first time, via our proposed SEAR framework, which operates through three key phases: (1) AR-based social context synthesis, which fuses Multimodal inputs (visual, auditory and environmental cues); (2) role-based Multimodal RAG (Retrieval-Augmented Generation), which dynamically retrieves and integrates contextual data while preserving character differentiation; and (3) ReInteract social engineering agents, which execute adaptive multiphase attack strategies through inference interaction loops. To verify SEAR, we conducted an IRB-approved study with 60 participants in three experimental configurations (unassisted, AR+LLM, and full SEAR pipeline) compiling a new dataset of 180 annotated conversations in simulated social scenarios. Our results show that SEAR is highly effective at eliciting high-risk behaviors (e.g., 93.3% of participants susceptible to email phishing). The framework was particularly effective in building trust, with 85% of targets willing to accept an attacker's call after an interaction. Also, we identified notable limitations such as ``occasionally artificial'' due to perceived authenticity gaps. This work provides proof-of-concept for AR-LLM driven social engineering attacks and insights for developing defensive countermeasures against next-generation augmented reality threats.

Problem

Research questions and friction points this paper is trying to address.

Investigates AR-driven social engineering attacks using Multimodal LLMs

Proposes SEAR framework for adaptive multiphase attack strategies

Evaluates effectiveness and limitations of AR-LLM social engineering

Innovation

Methods, ideas, or system contributions that make the work stand out.

AR-based social context synthesis with multimodal inputs

Role-based Multimodal RAG for contextual data integration

ReInteract agents for adaptive multiphase attack strategies

🔎 Similar Papers

On the Feasibility of Fully AI-automated Vishing Attacks

2024-09-20arXiv.orgCitations: 4

Apple

Sunnyvale, United States of America

Research Scientist Intern, Multimodal AI (PhD)