Beyond Single-Agent Safety: A Taxonomy of Risks in LLM-to-LLM Interactions

📅 2025-12-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing LLM safety mechanisms—such as prompt engineering, fine-tuning, and content moderation—are designed for isolated, single-model–single-user settings and fail to address emergent risks arising from recursive, multi-LLM agent chains. Locally aligned models may collectively fail due to interaction-induced emergent behaviors, highlighting the critical need to shift from “model-level” to “system-level” safety. Method: We propose the Emergent Systemic Risk Horizon (ESRH) framework, establishing a tri-level risk taxonomy spanning micro-level (individual agent behavior), meso-level (interaction patterns), and macro-level (ecosystem evolution). We further design InstitutionalAI—a self-regulating architecture enabling adaptive governance across multi-agent systems. Contribution/Results: Through formal modeling, multi-agent analysis, and dynamic governance design, we systematically characterize the mechanisms by which inter-LLM interactions generate collective risks. Our work provides the first foundational, embeddable framework for dynamic, system-aware governance in multi-agent LLM ecosystems.

Technology Category

Application Category

📝 Abstract
This paper examines why safety mechanisms designed for human-model interaction do not scale to environments where large language models (LLMs) interact with each other. Most current governance practices still rely on single-agent safety containment, prompts, fine-tuning, and moderation layers that constrain individual model behavior but leave the dynamics of multi-model interaction ungoverned. These mechanisms assume a dyadic setting: one model responding to one user under stable oversight. Yet research and industrial development are rapidly shifting toward LLM-to-LLM ecosystems, where outputs are recursively reused as inputs across chains of agents. In such systems, local compliance can aggregate into collective failure even when every model is individually aligned. We propose a conceptual transition from model-level safety to system-level safety, introducing the framework of the Emergent Systemic Risk Horizon (ESRH) to formalize how instability arises from interaction structure rather than from isolated misbehavior. The paper contributes (i) a theoretical account of collective risk in interacting LLMs, (ii) a taxonomy connecting micro, meso, and macro-level failure modes, and (iii) a design proposal for InstitutionalAI, an architecture for embedding adaptive oversight within multi-agent systems.
Problem

Research questions and friction points this paper is trying to address.

Addresses safety risks in multi-agent LLM interactions
Proposes system-level safety beyond individual model alignment
Introduces taxonomy for collective failure modes in LLMs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Transition from model-level to system-level safety
Introduce Emergent Systemic Risk Horizon framework
Propose InstitutionalAI architecture for adaptive oversight
🔎 Similar Papers
No similar papers found.
Piercosma Bisconti
Piercosma Bisconti
Assistant Professor, Sapienza University of Rome & DEXAI - Artificial Ethics
Political PhilosophyAI TrustworthinessHuman-Robot interactionsPhilosophy of Technology
M
Marcello Galisai
DEXAI – Icaro Lab, Sapienza University of Rome
F
Federico Pierucci
DEXAI – Icaro Lab, Sant’Anna School of Advanced Studies
M
Marcantonio Bracale
DEXAI – Icaro Lab
M
Matteo Prandi
DEXAI – Icaro Lab, Sapienza University of Rome