AI Deception: Risks, Dynamics, and Controls

📅 2025-11-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
AI deception—defined as AI agents inducing false beliefs to advance their own objectives—has evolved from a theoretical concern into an empirically observed security risk in large language models and autonomous agents. Method: Grounded in signaling theory, we formalize a definition of AI deception and propose a dual-loop “deception generation–mitigation” analytical framework. We introduce, for the first time, a three-level incentive model and three prerequisite capabilities, culminating in a deception cycle theory. Our methodology integrates signal-theoretic analysis, empirical experimentation, and static/interactive evaluation, underpinned by a benchmark and auditing protocol targeting supervision gaps, distributional shifts, and environmental stressors. Contribution/Results: We deliver a scalable detection benchmark, a hierarchical mitigation strategy, a multi-stakeholder governance framework, and an open platform (www.deceptionsurvey.com) to foster interdisciplinary research on AI deception.

Technology Category

Application Category

📝 Abstract
As intelligence increases, so does its shadow. AI deception, in which systems induce false beliefs to secure self-beneficial outcomes, has evolved from a speculative concern to an empirically demonstrated risk across language models, AI agents, and emerging frontier systems. This project provides a comprehensive and up-to-date overview of the AI deception field, covering its core concepts, methodologies, genesis, and potential mitigations. First, we identify a formal definition of AI deception, grounded in signaling theory from studies of animal deception. We then review existing empirical studies and associated risks, highlighting deception as a sociotechnical safety challenge. We organize the landscape of AI deception research as a deception cycle, consisting of two key components: deception emergence and deception treatment. Deception emergence reveals the mechanisms underlying AI deception: systems with sufficient capability and incentive potential inevitably engage in deceptive behaviors when triggered by external conditions. Deception treatment, in turn, focuses on detecting and addressing such behaviors. On deception emergence, we analyze incentive foundations across three hierarchical levels and identify three essential capability preconditions required for deception. We further examine contextual triggers, including supervision gaps, distributional shifts, and environmental pressures. On deception treatment, we conclude detection methods covering benchmarks and evaluation protocols in static and interactive settings. Building on the three core factors of deception emergence, we outline potential mitigation strategies and propose auditing approaches that integrate technical, community, and governance efforts to address sociotechnical challenges and future AI risks. To support ongoing work in this area, we release a living resource at www.deceptionsurvey.com.
Problem

Research questions and friction points this paper is trying to address.

Defines AI deception using signaling theory from animal studies.
Analyzes emergence of AI deception through incentives and capabilities.
Proposes detection and mitigation strategies for AI deception risks.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Define AI deception using signaling theory
Analyze deception emergence via incentives and capabilities
Propose mitigation strategies integrating technical and governance efforts
🔎 Similar Papers
No similar papers found.
B
Boyuan Chen
S
Sitong Fang
J
Jiaming Ji
Y
Yanxu Zhu
P
Pengcheng Wen
J
Jinzhou Wu
Y
Yingshui Tan
B
Boren Zheng
M
Mengying Yuan
W
Wenqi Chen
Donghai Hong
Donghai Hong
Peking University
AI SafetyAI AlignmentMulti-Modal Model
Alex Qiu
Alex Qiu
Stanford University
Robot LearningComputer Vision
X
Xin Chen
J
Jiayi Zhou
Kaile Wang
Kaile Wang
Peking University
J
Juntao Dai
Borong Zhang
Borong Zhang
University of Macau
Reinforcement learningRobotics
T
Tianzhuo Yang
S
Saad Siddiqui
I
Isabella Duan
Yawen Duan
Yawen Duan
University of Cambridge
Deep LearningArtificial IntelligenceAI Safety
B
Brian Tse
Jen-Tse Huang
Jen-Tse Huang
Johns Hopkins University
Artificial IntelligenceNatural Language ProcessingLarge Language Models
K
Kun Wang