Toward a Principled Framework for Agent Safety Measurement

📅 2026-05-02

📈 Citations: 0

✨ Influential: 0

career value

242K/year

🤖 AI Summary

Current safety evaluations of large language model agents predominantly rely on greedy decoding or a limited number of sampled trajectories, which often fail to uncover low-probability yet high-risk unsafe behaviors. This work formulates safety evaluation as a trajectory search problem under a likelihood budget constraint and introduces a systematic search method based on the Best-of-A framework. By integrating batched decoding, prefix caching, and chunked tree expansion, the approach efficiently explores the space of multi-turn interactive trajectories. It enables joint assessment of models, defense mechanisms, and attack strategies on a unified scale, successfully identifying hazardous behaviors missed by conventional methods across multiple safety benchmarks. The proposed technique delivers high-fidelity safety scores and reliable model rankings with controllable GPU overhead.

📝 Abstract

LLM agents emit actions, not just text, and once taken, those actions often cannot be undone. Yet today's agent-safety evaluations run greedy or a few sampled rollouts and report a single safe/unsafe rate -- blind to the long-tail trajectories where unsafe behavior may arise from low-probability but non-negligible actions. We argue agent safety should be measured by search, not sampling. We apply BOA, a framework that, given a deployment configuration (model, decoder, prompt, environment, judger, likelihood budget), searches the in-budget trajectory space and reports a safety score: the probability the agent stays safe under the configuration. BOA searches both within a single LLM round and across the agent-environment interaction tree under a given likelihood budget, and makes search practical via batched decoding/judging, prefix caching, and chunked tree expansion. On agent-safety workloads, BOA discovers unsafe trajectories that greedy and sampled evaluations miss. BOA can additionally be used for ranking models, defenses, and attacks, all on the same scale, with manageable GPU costs.

Problem

Research questions and friction points this paper is trying to address.

agent safety

long-tail trajectories

safety evaluation

LLM agents

unsafe behavior

Innovation

Methods, ideas, or system contributions that make the work stand out.

agent safety

search-based evaluation

trajectory search