The Devil Behind Moltbook: Anthropic Safety is Always Vanishing in Self-Evolving AI Societies

๐Ÿ“… 2026-02-10
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work proposes and formalizes the โ€œself-evolution trilemma,โ€ which asserts that in a fully closed multi-agent system, persistent self-evolution, complete isolation, and safe alignment with human values cannot be simultaneously satisfied. Within an information-theoretic framework, human value alignment is modeled as a distributional divergence, and it is theoretically demonstrated that closed-loop self-evolution inevitably induces statistical blind spots, leading to irreversible degradation of safety alignment. Empirical validation through multi-agent simulation platforms such as Moltbook confirms this phenomenon, revealing the intrinsic mechanism driving safety decay in isolated AI societies. This study advances AI safety research beyond patch-based defenses toward a fundamental understanding of dynamically evolving risks and underscores the necessity of external oversight or novel mechanisms to sustain long-term alignment.

Technology Category

Application Category

๐Ÿ“ Abstract
The emergence of multi-agent systems built from large language models (LLMs) offers a promising paradigm for scalable collective intelligence and self-evolution. Ideally, such systems would achieve continuous self-improvement in a fully closed loop while maintaining robust safety alignment--a combination we term the self-evolution trilemma. However, we demonstrate both theoretically and empirically that an agent society satisfying continuous self-evolution, complete isolation, and safety invariance is impossible. Drawing on an information-theoretic framework, we formalize safety as the divergence degree from anthropic value distributions. We theoretically demonstrate that isolated self-evolution induces statistical blind spots, leading to the irreversible degradation of the system's safety alignment. Empirical and qualitative results from an open-ended agent community (Moltbook) and two closed self-evolving systems reveal phenomena that align with our theoretical prediction of inevitable safety erosion. We further propose several solution directions to alleviate the identified safety concern. Our work establishes a fundamental limit on the self-evolving AI societies and shifts the discourse from symptom-driven safety patches to a principled understanding of intrinsic dynamical risks, highlighting the need for external oversight or novel safety-preserving mechanisms.
Problem

Research questions and friction points this paper is trying to address.

self-evolving AI
safety alignment
multi-agent systems
anthropic values
closed-loop evolution
Innovation

Methods, ideas, or system contributions that make the work stand out.

self-evolving AI societies
safety alignment
information-theoretic framework
anthropic values
multi-agent systems
๐Ÿ”Ž Similar Papers
No similar papers found.
Chenxu Wang
Chenxu Wang
University of Science and Technology of China
Chaozhuo Li
Chaozhuo Li
Microsoft Research Aisa
S
Songyang Liu
Beijing University of Posts and Telecommunications, Beijing, China
Z
Zejian Chen
Beijing University of Posts and Telecommunications, Beijing, China
J
Jinyu Hou
Beijing University of Posts and Telecommunications, Beijing, China
J
Ji Qi
Beijing University of Posts and Telecommunications, Beijing, China
Rui Li
Rui Li
Renmin University of China
LLM AgentsGraph Learning
Litian Zhang
Litian Zhang
Beihang University
Qiwei Ye
Qiwei Ye
Beijing Academy of Artificial Intelligence
Scientific AIAI for ScienceFoundation Model
Z
Zheng Liu
Beijing Academy of Artificial Intelligence, Beijing, China
X
Xu Chen
Renmin University of China, Beijing, China
Xi Zhang
Xi Zhang
Professor, Beijing University of Posts and Telecommunications
Data MiningComputer ArchitectureTrustworthy AI
Philip S. Yu
Philip S. Yu
Professor of Computer Science, University of Illinons at Chicago
Data miningDatabasePrivacy