🤖 AI Summary
In multi-agent AI systems, emergent risks arise from agent interactions—risks absent in single-agent settings—rendering conventional single-agent safety methods inadequate. This paper addresses organizationally controlled deployment scenarios and proposes the first structured risk analysis framework specifically designed for LLM-based multi-agent systems. First, it systematically defines six novel, interaction-induced failure modes unique to such systems. Second, it introduces a progressive verification methodology centered on analytical validity, incrementally increasing confidence across deployment stages. Third, it integrates simulation-based testing, observational analysis, benchmark evaluation, and red-teaming across multiple abstraction levels and deployment phases to collectively gather convergent evidence of safety. The framework seamlessly embeds into existing governance pipelines, providing a scalable, verifiable foundation for risk identification and assessment in multi-agent AI systems. (149 words)
📝 Abstract
Organisations are starting to adopt LLM-based AI agents, with their deployments naturally evolving from single agents towards interconnected, multi-agent networks. Yet a collection of safe agents does not guarantee a safe collection of agents, as interactions between agents over time create emergent behaviours and induce novel failure modes. This means multi-agent systems require a fundamentally different risk analysis approach than that used for a single agent.
This report addresses the early stages of risk identification and analysis for multi-agent AI systems operating within governed environments where organisations control their agent configurations and deployment. In this setting, we examine six critical failure modes: cascading reliability failures, inter-agent communication failures, monoculture collapse, conformity bias, deficient theory of mind, and mixed motive dynamics. For each, we provide a toolkit for practitioners to extend or integrate into their existing frameworks to assess these failure modes within their organisational contexts.
Given fundamental limitations in current LLM behavioural understanding, our approach centres on analysis validity, and advocates for progressively increasing validity through staged testing across stages of abstraction and deployment that gradually increases exposure to potential negative impacts, while collecting convergent evidence through simulation, observational analysis, benchmarking, and red teaming. This methodology establishes the groundwork for robust organisational risk management as these LLM-based multi-agent systems are deployed and operated.