The Devil Behind Moltbook: Anthropic Safety is Always Vanishing in Self-Evolving AI Societies

📅 2026-02-10

📈 Citations: 0

✨ Influential: 0

career value

244K/year

🤖 AI Summary

This work proposes and formalizes the “self-evolution trilemma,” which asserts that in a fully closed multi-agent system, persistent self-evolution, complete isolation, and safe alignment with human values cannot be simultaneously satisfied. Within an information-theoretic framework, human value alignment is modeled as a distributional divergence, and it is theoretically demonstrated that closed-loop self-evolution inevitably induces statistical blind spots, leading to irreversible degradation of safety alignment. Empirical validation through multi-agent simulation platforms such as Moltbook confirms this phenomenon, revealing the intrinsic mechanism driving safety decay in isolated AI societies. This study advances AI safety research beyond patch-based defenses toward a fundamental understanding of dynamically evolving risks and underscores the necessity of external oversight or novel mechanisms to sustain long-term alignment.

Technology Category

Application Category

📝 Abstract

The emergence of multi-agent systems built from large language models (LLMs) offers a promising paradigm for scalable collective intelligence and self-evolution. Ideally, such systems would achieve continuous self-improvement in a fully closed loop while maintaining robust safety alignment--a combination we term the self-evolution trilemma. However, we demonstrate both theoretically and empirically that an agent society satisfying continuous self-evolution, complete isolation, and safety invariance is impossible. Drawing on an information-theoretic framework, we formalize safety as the divergence degree from anthropic value distributions. We theoretically demonstrate that isolated self-evolution induces statistical blind spots, leading to the irreversible degradation of the system's safety alignment. Empirical and qualitative results from an open-ended agent community (Moltbook) and two closed self-evolving systems reveal phenomena that align with our theoretical prediction of inevitable safety erosion. We further propose several solution directions to alleviate the identified safety concern. Our work establishes a fundamental limit on the self-evolving AI societies and shifts the discourse from symptom-driven safety patches to a principled understanding of intrinsic dynamical risks, highlighting the need for external oversight or novel safety-preserving mechanisms.

Problem

Research questions and friction points this paper is trying to address.

self-evolving AI

safety alignment

multi-agent systems

anthropic values

closed-loop evolution

Innovation

Methods, ideas, or system contributions that make the work stand out.

self-evolving AI societies

safety alignment

information-theoretic framework