PEAR: Planner-Executor Agent Robustness Benchmark

📅 2025-10-08

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

Research on the robustness of LLM-driven planning-execution multi-agent systems (MAS) is fragmented and lacks systematic evaluation. Method: This paper introduces the first comprehensive benchmark specifically designed for this architecture, systematically injecting adversarial perturbations into both planning and execution modules within a multi-agent simulation framework to assess practicality and vulnerability across cleaning and attack scenarios. Contribution/Results: Key findings reveal that the planning module is significantly more vulnerable than the execution module; memory mechanisms confer no robustness improvement to actuators; and a strong trade-off exists between task performance and robustness. This work is the first to empirically identify differentiated vulnerability patterns across core MAS components, establishing an evidence-based foundation and a reproducible evaluation paradigm for designing secure, robust multi-agent systems.

Technology Category

Application Category

📝 Abstract

Large Language Model (LLM)-based Multi-Agent Systems (MAS) have emerged as a powerful paradigm for tackling complex, multi-step tasks across diverse domains. However, despite their impressive capabilities, MAS remain susceptible to adversarial manipulation. Existing studies typically examine isolated attack surfaces or specific scenarios, leaving a lack of holistic understanding of MAS vulnerabilities. To bridge this gap, we introduce PEAR, a benchmark for systematically evaluating both the utility and vulnerability of planner-executor MAS. While compatible with various MAS architectures, our benchmark focuses on the planner-executor structure, which is a practical and widely adopted design. Through extensive experiments, we find that (1) a weak planner degrades overall clean task performance more severely than a weak executor; (2) while a memory module is essential for the planner, having a memory module for the executor does not impact the clean task performance; (3) there exists a trade-off between task performance and robustness; and (4) attacks targeting the planner are particularly effective at misleading the system. These findings offer actionable insights for enhancing the robustness of MAS and lay the groundwork for principled defenses in multi-agent settings.

Problem

Research questions and friction points this paper is trying to address.

Evaluating vulnerabilities in planner-executor multi-agent systems

Assessing trade-offs between task performance and system robustness

Analyzing effectiveness of adversarial attacks targeting planner components

Innovation

Methods, ideas, or system contributions that make the work stand out.

PEAR benchmark evaluates planner-executor multi-agent systems

Systematically tests both utility and vulnerability of MAS

Focuses on planner attacks and robustness trade-offs

🔎 Similar Papers

Long-Horizon Planning for Multi-Agent Robots in Partially Observable Environments