Auditing Agent Harness Safety

📅 2026-05-13

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

Current safety evaluations focus solely on final outputs, failing to capture intermediate violations such as privilege escalation or information leakage during execution. This work proposes HarnessAudit, a novel framework that enables fine-grained security auditing of complete execution trajectories in multi-agent systems for the first time. It establishes a new evaluation paradigm along three dimensions: boundary compliance, execution fidelity, and system stability. Building upon this framework, we introduce HarnessAudit-Bench, a benchmark comprising 210 tasks, and evaluate ten representative agent configurations. Our analysis reveals that safety violations accumulate over execution trajectories, with risks predominantly arising from resource access and inter-agent communication. Furthermore, multi-agent collaboration substantially expands the attack surface, while the design of the execution framework fundamentally determines the upper bound of system security.

📝 Abstract

LLM agents increasingly run inside execution harnesses that dispatch tools, allocate resources, and route messages between specialized components. However, a harness can return a correct, benign answer over a trajectory that accesses unauthorized resources or leaks context to the wrong agent. Output-level evaluation cannot see these failures, yet most safety benchmarks score only final outputs or terminal states, even though many violations occur mid-trajectory rather than at termination. The central question is whether the harness respects user intent, permission boundaries, and information-flow constraints throughout execution. To address this gap, we propose HarnessAudit, a framework that audits full execution trajectories across boundary compliance, execution fidelity, and system stability, with a focus on multi-agent harnesses where these risks are most pronounced. We further introduce HarnessAudit-Bench, a benchmark of 210 tasks across eight real-world domains, instantiated in both single-agent and multi-agent configurations with embedded safety constraints. Evaluating ten harness configurations across frontier models and three multi-agent frameworks, we find that: (i) task completion is misaligned with safe execution, and violations accumulate with trajectory length; (ii) safety risks vary across domains, task types, and agent roles; (iii) most violations concentrate in resource access and inter-agent information transfer; and (iv) multi-agent collaboration expands the safety risk surface, while harness design sets the upper bound of safe deployment.

Problem

Research questions and friction points this paper is trying to address.

execution trajectory

safety audit

permission boundaries

information-flow constraints

multi-agent harnesses

Innovation

Methods, ideas, or system contributions that make the work stand out.

HarnessAudit

execution trajectory auditing

multi-agent safety