Detection of adversarial intent in Human-AI teams using LLMs

📅 2026-03-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the vulnerability of human–large language model (LLM) collaborative systems to adversarial manipulation, where malicious actors can induce harmful decisions through prompt injection or data poisoning. Existing detection approaches rely on task-specific heuristics and exhibit limited generalizability. To overcome this limitation, the paper proposes, for the first time, leveraging an LLM as a task-agnostic defensive overseer that continuously monitors interaction logs during multi-turn human–AI dialogues. By capitalizing on the LLM’s contextual understanding and reasoning capabilities, the method identifies adversarial intent in real time without requiring task-specific rules. Evaluated on 25 rounds of authentic collaborative dialogues, the approach effectively detects malicious behaviors that evade conventional heuristic-based detectors, significantly enhancing team-level robustness against targeted attacks.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) are increasingly deployed in human-AI teams as support agents for complex tasks such as information retrieval, programming, and decision-making assistance. While these agents' autonomy and contextual knowledge enables them to be useful, it also exposes them to a broad range of attacks, including data poisoning, prompt injection, and even prompt engineering. Through these attack vectors, malicious actors can manipulate an LLM agent to provide harmful information, potentially manipulating human agents to make harmful decisions. While prior work has focused on LLMs as attack targets or adversarial actors, this paper studies their potential role as defensive supervisors within mixed human-AI teams. Using a dataset consisting of multi-party conversations and decisions for a real human-AI team over a 25 round horizon, we formulate the problem of malicious behavior detection from interaction traces. We find that LLMs are capable of identifying malicious behavior in real-time, and without task-specific information, indicating the potential for task-agnostic defense. Moreover, we find that the malicious behavior of interest is not easily identified using simple heuristics, further suggesting the introduction of LLM defenders could render human teams more robust to certain classes of attack.
Problem

Research questions and friction points this paper is trying to address.

adversarial intent
human-AI teams
LLM security
malicious behavior detection
prompt injection
Innovation

Methods, ideas, or system contributions that make the work stand out.

adversarial intent detection
large language models
human-AI teams
task-agnostic defense
malicious behavior identification
🔎 Similar Papers
No similar papers found.
A
Abed K. Musaffar
Department of Mechanical Engineering, University of California at Santa Barbara
A
Ambuj Singh
Department of Computer Science, University of California at Santa Barbara
Francesco Bullo
Francesco Bullo
Professor of Mechanical Engineering, UC Santa Barbara
Systems and ControlMulti-Agent SystemsRobotic NetworksPower SystemsSocial Networks