AdvEvo-MARL: Shaping Internalized Safety through Adversarial Co-Evolution in Multi-Agent Reinforcement Learning

📅 2025-10-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Multi-agent large language model (LLM) systems face severe security threats—including jailbreaking, prompt injection, and adversarial collaboration—due to their open interaction protocols; existing defenses (e.g., self-verification or external guardians) suffer from insufficient robustness, high computational overhead, or single-point failure. To address this, we propose the first adversarial co-evolution framework tailored for multi-agent reinforcement learning, wherein attackers and defenders are jointly trained to internalize safety capabilities directly within task-performing agents. We introduce a shared advantage baseline and group-level mean return estimation to stabilize training and eliminate reliance on external modules. Experiments across diverse attack scenarios demonstrate that our approach reduces attack success rates to below 20%—a 18.33-percentage-point improvement over baselines—while simultaneously increasing task accuracy by 3.67%, thereby achieving joint optimization of security and functional performance.

Technology Category

Application Category

📝 Abstract
LLM-based multi-agent systems excel at planning, tool use, and role coordination, but their openness and interaction complexity also expose them to jailbreak, prompt-injection, and adversarial collaboration. Existing defenses fall into two lines: (i) self-verification that asks each agent to pre-filter unsafe instructions before execution, and (ii) external guard modules that police behaviors. The former often underperforms because a standalone agent lacks sufficient capacity to detect cross-agent unsafe chains and delegation-induced risks; the latter increases system overhead and creates a single-point-of-failure-once compromised, system-wide safety collapses, and adding more guards worsens cost and complexity. To solve these challenges, we propose AdvEvo-MARL, a co-evolutionary multi-agent reinforcement learning framework that internalizes safety into task agents. Rather than relying on external guards, AdvEvo-MARL jointly optimizes attackers (which synthesize evolving jailbreak prompts) and defenders (task agents trained to both accomplish their duties and resist attacks) in adversarial learning environments. To stabilize learning and foster cooperation, we introduce a public baseline for advantage estimation: agents within the same functional group share a group-level mean-return baseline, enabling lower-variance updates and stronger intra-group coordination. Across representative attack scenarios, AdvEvo-MARL consistently keeps attack-success rate (ASR) below 20%, whereas baselines reach up to 38.33%, while preserving-and sometimes improving-task accuracy (up to +3.67% on reasoning tasks). These results show that safety and utility can be jointly improved without relying on extra guard agents or added system overhead.
Problem

Research questions and friction points this paper is trying to address.

Enhancing safety in multi-agent systems against adversarial attacks and jailbreaks
Overcoming limitations of external guards and self-verification defense methods
Internalizing safety capabilities within task agents through adversarial co-evolution
Innovation

Methods, ideas, or system contributions that make the work stand out.

Internalizes safety into task agents via adversarial co-evolution
Jointly optimizes attackers and defenders in adversarial learning
Uses shared baseline for advantage estimation to stabilize learning
🔎 Similar Papers
No similar papers found.