Who Tests the Testers? Systematic Enumeration and Coverage Audit of LLM Agent Tool Call Safety

📅 2026-03-18

📈 Citations: 0

✨ Influential: 0

career value

163K/year

🤖 AI Summary

This work addresses the insufficient test coverage of existing safety evaluation benchmarks for large language model (LLM) agents in tool-use scenarios, which often fail to uncover unsafe behaviors that persist even after passing standard evaluations. To tackle this limitation, the authors propose SafeAudit, a novel framework that introduces a meta-audit perspective by leveraging LLM-driven systematic test case enumeration, modeling of tool-invocation workflows, and a semantics-agnostic metric of rule-resistance. Empirical evaluation across three widely used benchmarks and twelve environments reveals over 20% previously undetected residual unsafe behaviors. Furthermore, the coverage of these latent risks increases monotonically with the testing budget, effectively exposing critical blind spots in current safety assessment protocols.

Technology Category

Application Category

📝 Abstract

Large Language Model (LLM) agents increasingly act through external tools, making their safety contingent on tool-call workflows rather than text generation alone. While recent benchmarks evaluate agents across diverse environments and risk categories, a fundamental question remains unanswered: how complete are existing test suites, and what unsafe interaction patterns persist even after an agent passes the benchmark? We propose SafeAudit, a meta-audit framework that addresses this gap through two contributions. First, an LLM-based enumerator that systematically generates test cases by enumerating valid tool-call workflows and diverse user scenarios. Second, we introduce rule-resistance, a non-semantic, quantitative metric that distills compact safety rules from existing benchmarks and identifies unsafe interaction patterns that remain uncovered under those rules. Across 3 benchmarks and 12 environments, SafeAudit uncovers more than 20% residual unsafe behaviors that existing benchmarks fail to expose, with coverage growing monotonically as the testing budget increases. Our results highlight significant completeness gaps in current safety evaluation and motivate meta-auditing as a necessary complement to benchmark-based agent safety testing.

Problem

Research questions and friction points this paper is trying to address.

LLM agent safety

tool call safety

test suite completeness

unsafe interaction patterns

safety evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

SafeAudit

tool-call safety

systematic enumeration