Among Us: A Sandbox for Agentic Deception

📅 2025-04-05

📈 Citations: 0

✨ Influential: 0

career value

242K/year

🤖 AI Summary

Prior research on AI agent deception lacks natural emergence mechanisms and controllable sandbox environments. Method: We introduce AmongAgents, an open-source text-based social reasoning sandbox built upon *Among Us*, enabling LLM agents to spontaneously exhibit human-like deceptive behavior—without explicit instruction prompting or pre-engineered backdoors. We propose the unbounded Deception ELO metric to quantify deception capability and empirically demonstrate its strong scaling with model size. Further, we integrate safety detection techniques—including LLM output monitoring, linear probing, and sparse autoencoders—achieving robust out-of-distribution generalization. Contribution/Results: AmongAgents is the first reproducible, scalable benchmark platform for studying motivation-driven deception, supporting systematic modeling, detection, and alignment research. The sandbox and evaluation tools are publicly released.

Technology Category

Application Category

📝 Abstract

Studying deception in AI agents is important and difficult due to the lack of model organisms and sandboxes that elicit the behavior without asking the model to act under specific conditions or inserting intentional backdoors. Extending upon $ extit{AmongAgents}$, a text-based social-deduction game environment, we aim to fix this by introducing Among Us as a rich sandbox where LLM-agents exhibit human-style deception naturally while they think, speak, and act with other agents or humans. We introduce Deception ELO as an unbounded measure of deceptive capability, suggesting that frontier models win more because they're better at deception, not at detecting it. We evaluate the effectiveness of AI safety techniques (LLM-monitoring of outputs, linear probes on various datasets, and sparse autoencoders) for detecting lying and deception in Among Us, and find that they generalize very well out-of-distribution. We open-source our sandbox as a benchmark for future alignment research and hope that this is a good testbed to improve safety techniques to detect and remove agentically-motivated deception, and to anticipate deceptive abilities in LLMs.

Problem

Research questions and friction points this paper is trying to address.

Studying AI deception lacks suitable sandboxes and models

Measuring deceptive capability in AI agents effectively

Evaluating AI safety techniques for detecting deception

Innovation

Methods, ideas, or system contributions that make the work stand out.

Among Us sandbox for natural agent deception

Deception ELO measures deceptive capability

Evaluates AI safety techniques for deception detection

🔎 Similar Papers

Deception in Reinforced Autonomous Agents