🤖 AI Summary
This work addresses a critical limitation in existing agent evaluation methods, which focus solely on final-state compliance and thus fail to detect “latent failures”—cases where agents achieve correct outcomes by coincidence while bypassing essential policy checks. To remedy this, the study introduces the first process-oriented audit of agent tool-use trajectories, compiling natural language policies into executable guard code via the ToolGuard framework to verify whether decisions adhere to prescribed strategies throughout execution. Experiments on the τ²-verified Airlines benchmark reveal that 8%–17% of trajectories involving state modifications exhibit such latent failures, exposing a significant blind spot in current evaluation paradigms. These findings underscore both the necessity and efficacy of assessing procedural compliance alongside outcome correctness.
📝 Abstract
Agentic systems for business process automation often require compliance with policies governing conditional updates to the system state. Evaluation of policy adherence in LLM-based agentic workflows is typically performed by comparing the final system state against a predefined ground truth. While this approach detects explicit policy violations, it may overlook a more subtle class of issues in which agents bypass required policy checks, yet reach a correct outcome due to favorable circumstances. We refer to such cases as $\textit{near-misses}$ or $\textit{latent failures}$. In this work, we introduce a novel metric for detecting latent policy failures in agent conversations traces. Building on the ToolGuard framework, which converts natural-language policies into executable guard code, our method analyzes agent trajectories to determine whether agent's tool-calling decisions where sufficiently informed.
We evaluate our approach on the $τ^2$-verified Airlines benchmark across several contemporary open and proprietary LLMs acting as agents. Our results show that latent failures occur in 8-17% of trajectories involving mutating tool calls, even when the final outcome matches the expected ground-truth state. These findings reveal a blind spot in current evaluation methodologies and highlight the need for metrics that assess not only final outcomes but also the decision process leading to them.