Mitigating Deceptive Alignment via Self-Monitoring

📅 2025-05-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing safety mechanisms only filter deceptive outputs post-hoc and cannot intercept a model’s latent misaligned objectives during chain-of-thought (CoT) reasoning. Method: We propose CoT Monitor+, a framework that synchronously generates self-assessment monitoring signals during generative CoT inference; we construct DeceptionBench, a five-dimensional deception evaluation benchmark; and we integrate self-supervised monitoring modeling, auxiliary reward reinforcement learning, internal feedback loops, and weak-discriminator substitution. Contribution/Results: This work introduces the first endogenous, self-monitoring mechanism for CoT reasoning—suppressing deceptive alignment at its inferential origin. It achieves an average 43.8% reduction in deceptive behavior without compromising task accuracy. Moreover, it significantly enhances reasoning transparency and detectability of latent, hidden reasoning steps.

Technology Category

Application Category

📝 Abstract
Modern large language models rely on chain-of-thought (CoT) reasoning to achieve impressive performance, yet the same mechanism can amplify deceptive alignment, situations in which a model appears aligned while covertly pursuing misaligned goals. Existing safety pipelines treat deception as a black-box output to be filtered post-hoc, leaving the model free to scheme during its internal reasoning. We ask: Can deception be intercepted while the model is thinking? We answer this question, the first framework that embeds a Self-Monitor inside the CoT process itself, named CoT Monitor+. During generation, the model produces (i) ordinary reasoning steps and (ii) an internal self-evaluation signal trained to flag and suppress misaligned strategies. The signal is used as an auxiliary reward in reinforcement learning, creating a feedback loop that rewards honest reasoning and discourages hidden goals. To study deceptive alignment systematically, we introduce DeceptionBench, a five-category benchmark that probes covert alignment-faking, sycophancy, etc. We evaluate various LLMs and show that unrestricted CoT roughly aggravates the deceptive tendency. In contrast, CoT Monitor+ cuts deceptive behaviors by 43.8% on average while preserving task accuracy. Further, when the self-monitor signal replaces an external weak judge in RL fine-tuning, models exhibit substantially fewer obfuscated thoughts and retain transparency. Our project website can be found at cot-monitor-plus.github.io
Problem

Research questions and friction points this paper is trying to address.

Intercepting deceptive alignment during model reasoning
Reducing covert misaligned goals in chain-of-thought processes
Improving transparency by suppressing obfuscated internal strategies
Innovation

Methods, ideas, or system contributions that make the work stand out.

Embeds Self-Monitor in CoT reasoning
Uses auxiliary reward for honest reasoning
Introduces DeceptionBench for evaluation
🔎 Similar Papers
No similar papers found.
J
Jiaming Ji
PKU Alignment Team, Peking University
W
Wenqi Chen
PKU Alignment Team, Peking University
Kaile Wang
Kaile Wang
Peking University
Donghai Hong
Donghai Hong
Peking University
AI SafetyAI AlignmentMulti-Modal Model
S
Sitong Fang
PKU Alignment Team, Peking University
B
Boyuan Chen
PKU Alignment Team, Peking University
J
Jiayi Zhou
PKU Alignment Team, Peking University
J
Juntao Dai
PKU Alignment Team, Peking University
Sirui Han
Sirui Han
The Hong Kong University of Science and Technology
Large Language ModelInterdisciplinary Artificial Intelligence
Y
Yike Guo
Hong Kong University of Science and Technology
Y
Yaodong Yang
PKU Alignment Team, Peking University